STA 3024 Data Analysis 4 Due: April 9 D. Meeter, Spring, '01

The data file corn.xls was sent to me by a student in the Philippine Islands who collected it for a project. He could not get R2 over 10%. There are n = 240 records, one for each cornfield.

The response is yield of corn. The predictors are Season, a dummy variable (=1 in dry season, 0 in wet season), and the following variables that are totals over the days from sowing to harvest for each particular field: rainfall in mm, degree-days over 10o C, and solar radiation. Your objective is to obtain a statistically adequate prediction equation for corn yield using these variables. Assume that the observations are listed in order of planting time.

Plot the data! Graph>Matrix Plot Graph variables: Yield Rain Temp Solar

Data display: for each Group Group variables Season

Options Matrix display · Upper right OK OK

Notice that there are two groups of points involving rain - apparently the dry season - then look for the '+' points that indicate the dry season, especially in Yield vs. Rain. To see what is wrong,

1. Graph>Plot Graph var..s: Yield Rain Data display: for each Group Group var..s Season

Assume 'Season' is defined historically, not from the present year. To find out which rows are mislabeled, with the Yield/Rain graph still active, use Editor>Brush to click on the "wrong season" points to find out their row number; also notice the Rain values for these points. Also helpful: Graph>Time Series Plot Graph variables: Rain If Row 1 is the earliest planting, what do you think happened the year the data was taken?

Use Manip>Copy columns Copy from column(s): Season to column(s): your choice to save the original Season variable. Now change the first entries in Season to reflect the actual rainy season, and relabel it, e.g., Seasmod. You don't need to try transformations; only rain has a (weak) case for a log transformation.

2. Do a multiple regression using the modified Season and Rain, Temp, Solar.

Graphs Residuals for Plots · Deleted Residual Plots: (normal, resid. vs. fits, resid. vs. order)

Residuals vs. the variables: Rain Temp Solar (these plots to be used in 3b.) OK OK

Annotate each graph as to which predictors you used: double-click within graph, click T , click in graph just below subtitle; enter text in box. OK Editor>View This will reduce confusion!

a) What do you notice about the plots, especially residual vs. order? What does this say about bias in the model's predictions over the course of the growing season? Which predictors are important? Has R2 changed from 10%? b) Because of the pattern in the ordered residuals, plotting residuals vs. the predictors checks whether Rain, Temp, and Solar might have a curved relationship with Yield, instead of the linear one assumed by the model. What do you think?

Too Many Graphs? At some point you might exceed the number of allowed graphs. This can help: click an older open graph, File>Save Graph As (enter file name, check Save in:, then Save and you can X (delete) the graph.

To enter a time trend predictor variable as in the Ski Sales example in the notes,

Calc>Make Patterned Data>Simple Set of Numbers

Store patterned data in: pick a column From first value 1 To last value: 240 OK

Call this column Order. Since the trend looks quadratic, we also need Calc>Calculator Store result in variable: (you pick) Expression: Order**2 OK Label this column.

4. Redo the regression of 3. adding Order and Order**2 to the regression. Omit the Residuals vs. the variables plots used to answer 3b. Label your graphs.

What do you notice about the plots? Which predictors are important? Has R2 changed?

Order is not an "explanatory variable"; it presumably stands for something else. Your plots have eliminated Rain, Temp, or Solar as the source of the quadratic trend. Can you think of any other variable that explains the order effect? (I can't, but I'm not a biologist.)

5. Reexamine the Yield/Rain plot from 1. There is a different relationship between the two variables depending on Season. Describe this relationship. Create an interaction variable.

Calc>Calculator Store variable in: you pick Expression: select Seasmod, then * from the keypad, then Rain . You want Seasmod*Rain. This creates a variable that changes the slope in the dry season. Run a regression adding this variable to the regression you did in 4. Omit all plots; they are similar. Have any coefficients changed in important ways? Has R2 changed?

  1. Your conclusions and recommendations to the student.

I reran the regression omitting Temp and Solar, and got the following.

Yield = 2303 + 4.54 Rain + 4316 Seasmod - 8.07 RainxSeas + 5.95 Order

I calculated the predicted yield omitting the Order variables, and plotted it along with Yield:

This illustrates how the interaction variable works to change the slope of the line.