Exploratary Data Analysis

In this study, we will use a dataset published on Kaggle for demonstration: https://www.kaggle.com/datasets/rajanand/key-indicators-of-annual-health-survey for demonstration. Note that this dataset was extracted from the Annual Health Survey (AHS) from 2012 to 2013 in India.

The original dataset includes 26 different files, where each file focus on a dedicated topic (e.g. sex ratio, disability rate, etc.). Each topic consists of 2~10 different variables, where all the variables are numeric values representing percentages. We are interested in “Section 16: Ante Natal Care”, “Section 17: Delivery Care”, and “Section 25: Mortality”. These three sections will form our initial dataset. All the other sections are unrelated to our studies. Hence, they will be dropped.

The initial dataset has 8 variables, listed as follows. It also has 284 observations.

## [1] "State_Name"                                 
## [2] "State_District_Name"                        
## [3] "Mothers_Who_Received_Any_Antenatal_Check_Up"
## [4] "Mothers_Who_Had_Full_Antenatal_Check_Up"    
## [5] "Mothers_Who_Underwent_Ultrasound"           
## [6] "Institutional_Delivery"                     
## [7] "Delivery_At_Home"                           
## [8] "Infant_Mortality_Rate"

We check if there are any missing values in these eight variables. Our result shows that these variables are complete. This means that all eight variables have been filled for the entire 284 observations.

##                                  State_Name 
##                                           0 
##                         State_District_Name 
##                                           0 
## Mothers_Who_Received_Any_Antenatal_Check_Up 
##                                           0 
##     Mothers_Who_Had_Full_Antenatal_Check_Up 
##                                           0 
##            Mothers_Who_Underwent_Ultrasound 
##                                           0 
##                      Institutional_Delivery 
##                                           0 
##                            Delivery_At_Home 
##                                           0 
##                       Infant_Mortality_Rate 
##                                           0

Generate summary for the data.

##   State_Name        State_District_Name
##  Length:284         Length:284         
##  Class :character   Class :character   
##  Mode  :character   Mode  :character   
##                                        
##                                        
##                                        
##  Mothers_Who_Received_Any_Antenatal_Check_Up
##  Min.   :59.90                              
##  1st Qu.:86.47                              
##  Median :91.70                              
##  Mean   :90.00                              
##  3rd Qu.:94.80                              
##  Max.   :99.70                              
##  Mothers_Who_Had_Full_Antenatal_Check_Up Mothers_Who_Underwent_Ultrasound
##  Min.   : 1.01                           Min.   : 8.57                   
##  1st Qu.: 6.00                           1st Qu.:23.90                   
##  Median :10.82                           Median :32.10                   
##  Mean   :13.50                           Mean   :36.60                   
##  3rd Qu.:18.34                           3rd Qu.:47.52                   
##  Max.   :54.65                           Max.   :92.50                   
##  Institutional_Delivery Delivery_At_Home Infant_Mortality_Rate
##  Min.   :23.81          Min.   : 3.95    Min.   :19.22        
##  1st Qu.:53.56          1st Qu.:19.53    1st Qu.:47.95        
##  Median :64.34          Median :34.30    Median :55.00        
##  Mean   :64.77          Mean   :34.30    Mean   :56.18        
##  3rd Qu.:80.19          3rd Qu.:45.37    3rd Qu.:65.18        
##  Max.   :95.88          Max.   :75.92    Max.   :97.00

We proceed to plot the histogram of 5 predictors and the variable of interest. Notice that for a fixed given district, P(Institutional Delivery)=1-P(Delivery At Home).

Part A - Univarate Model

0. Data Ingestion

path_folder = "C:\\Users\\marcc\\Desktop\\"
file_name = "india_birth.csv"

df<- read.csv(paste0(path_folder, file_name), header = T)
df_uni<- data.frame(df$Institutional_Delivery, df$Infant_Mortality_Rate)
model <- lm(df.Infant_Mortality_Rate ~ df.Institutional_Delivery, data = df_uni)

1. Data Cleaning

1.1 Identifying influential points using Cooks Distance

D   <- cooks.distance(model)
lev <- which(D > 4/(nrow(df_uni)-2))

par(family = 'serif')
plot(df_uni$df.Institutional_Delivery, D, type = "p", 
     xlab = "Institutional Delivery", ylab = "Cook's Distances", 
     main = "Cook's Distance", col = ifelse(D > 4/(nrow(df_uni)-2), "red", "blue"))
text(df_uni$df.Institutional_Delivery[D > 4/(nrow(df_uni)-2)]+0.5, D[D > 4/(nrow(df_uni)-2)], 
     labels = which(D > 4/(nrow(df_uni)-2)) )

1.2 Remove influential points and refit model

df_uni <- df_uni[-c(79,82,83,112,141,211,265,266,272,274,281),]
model <- lm(df.Infant_Mortality_Rate ~ df.Institutional_Delivery, data = df_uni)

plot(df_uni$df.Institutional_Delivery, df_uni$df.Infant_Mortality_Rate, type = "p", pch = 20,
     xlab = "Institutional_Delivery", ylab = "Infant_Mortality_Rate", 
     main = "Institutional Delivery vs Infant Mortality Rate", col = "red")
abline(model, lwd = 2, col = "blue")

2. Diagnostic Checking

2.1 Normal Q-Q Plot (Normality Assumption)

r <- rstandard(model)
qqnorm(r)
qqline(r)

2.2 Standardized Residue vs Predictor Plot (Linearity Assumption)

hii <- hatvalues(model)
r   <- rstandard(model)

par(family = 'serif')
plot(df_uni$df.Institutional_Delivery, r, type = "p", 
     xlab = "Institutional Delivery", ylab = "Standardized Residuals",
     main = "Standardized Residuals", col = "red", xlim = c(0, 100), ylim = c(-4, 4))
abline(h = 2, lty = 2)
abline(h = -2, lty = 2)

2.3 Observe Patterns in Variance (Homoscedasticity Assumption)

hii   <- hatvalues(model)
r     <- rstandard(model)
r_mod <- sqrt(abs(r))

best_fit <- lm(r_mod ~ df_uni$df.Institutional_Delivery)

plot(df_uni$df.Institutional_Delivery, r_mod, type = "p", 
     xlab = "Institutional Delivery", ylab = "Standardized Residuals",
     main = "Standardized Residuals", col = "red", xlim = c(0, 100), ylim = c(-1, 3))
abline(best_fit, lwd = 2, col = "blue")

3. Analysis Process

3.1 Fitting the best line

model %>%
  tidy() %>%
  kable(caption = "The summary from the simple linear regression model")

The summary from the simple linear regression model
term	estimate	std.error	statistic	p.value
(Intercept)	54.3116065	3.1352501	17.3228944	0.0000000
df.Institutional_Delivery	0.0316342	0.0463868	0.6819651	0.4958435

plot(df_uni$df.Institutional_Delivery, df_uni$df.Infant_Mortality_Rate, type = "p", pch = 20,
     xlab = "Institutional_Delivery", ylab = "Infant_Mortality_Rate",
     main = "Institutional Delivery vs Infant Mortality Rate", col = "red")
abline(model, lwd = 2, col = "blue")

3.2 Create Confidence Intervals

df_uni <- df_uni[order(df_uni$df.Institutional_Delivery),]

yhat <- predict(model, newdata = df_uni, type = "response", interval = "none")
conf.inter <- predict(model, newdata = df_uni, type = "response", interval = "confidence", level = 0.95)
pred.inter <- predict(model, newdata = df_uni, type = "response", interval = "prediction", level = 0.95)

plot(df_uni$df.Institutional_Delivery, df_uni$df.Infant_Mortality_Rate, type = "p", xlab = "Institutional Delivery", ylab = "Infant Mortality Rate",
     main = "Institutional Delivery vs Infant Mortality Rate", col = "red", xlim = c(20, 100), ylim = c(20, 90))
abline(model, lwd = 2, col = "blue")
lines(df_uni$df.Institutional_Delivery, conf.inter[,2], lty = 2, col = "blue")
lines(df_uni$df.Institutional_Delivery, conf.inter[,3], lty = 2, col = "blue")
lines(df_uni$df.Institutional_Delivery, pred.inter[,2], lty = 2, col = "red")
lines(df_uni$df.Institutional_Delivery, pred.inter[,3], lty = 2, col = "red")

3.3 ANOVA Test for H0

Under H0 we have the following assumption: observations are normally distributed (checked 2.1), the errors are independent of each other (checked 2.2), observations have constant variance and mean 0 (checked 2.3).

summary(model)

## 
## Call:
## lm(formula = df.Infant_Mortality_Rate ~ df.Institutional_Delivery, 
##     data = df_uni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.381  -8.920  -1.082   8.735  31.852 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               54.31161    3.13525  17.323   <2e-16 ***
## df.Institutional_Delivery  0.03163    0.04639   0.682    0.496    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.89 on 271 degrees of freedom
## Multiple R-squared:  0.001713,   Adjusted R-squared:  -0.001971 
## F-statistic: 0.4651 on 1 and 271 DF,  p-value: 0.4958

anova(model)

qf(0.95, 1, 282)

## [1] 3.874645

As 0.4650763 < 3.874645, we failed to Reject H0, meaning Institutional delivery has no effect on Infant Mortality Rate.

Part B - Multivariate Model

0. Data Ingestion

path_folder = "C:\\Users\\marcc\\Desktop\\"
file_name = "india_birth.csv"

df <- read.csv(paste0(path_folder, file_name), header = T)
model <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Delivery_At_Home + Mothers_Who_Received_Any_Antenatal_Check_Up 
            + Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)

1. Check Multicollinearity

1.1 Scatter Plot Matrix and VIF

pairs(~ Institutional_Delivery + Delivery_At_Home + Mothers_Who_Received_Any_Antenatal_Check_Up +
        Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)

Xmat<- model.matrix(model)
XTX <- solve(t(Xmat)%*%Xmat)
rankMatrix(XTX)

## [1] 6
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.332268e-15

H <- Xmat%*%XTX%*%t(Xmat)
rankMatrix(H)

## [1] 6
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.306067e-14

summary(model)

## 
## Call:
## lm(formula = Infant_Mortality_Rate ~ Institutional_Delivery + 
##     Delivery_At_Home + Mothers_Who_Received_Any_Antenatal_Check_Up + 
##     Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.004  -8.993   0.303   8.591  43.748 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 99.85501   50.83905   1.964
## Institutional_Delivery                       0.32138    0.51155   0.628
## Delivery_At_Home                            -0.03446    0.51896  -0.066
## Mothers_Who_Received_Any_Antenatal_Check_Up -0.56068    0.14738  -3.804
## Mothers_Who_Had_Full_Antenatal_Check_Up     -0.11925    0.10113  -1.179
## Mothers_Who_Underwent_Ultrasound            -0.30701    0.05808  -5.286
##                                             Pr(>|t|)    
## (Intercept)                                 0.050510 .  
## Institutional_Delivery                      0.530352    
## Delivery_At_Home                            0.947109    
## Mothers_Who_Received_Any_Antenatal_Check_Up 0.000175 ***
## Mothers_Who_Had_Full_Antenatal_Check_Up     0.239334    
## Mothers_Who_Underwent_Ultrasound            2.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.72 on 278 degrees of freedom
## Multiple R-squared:  0.2002, Adjusted R-squared:  0.1858 
## F-statistic: 13.92 on 5 and 278 DF,  p-value: 3.824e-12

vif(model)  ## The vif function is from the car package

##                      Institutional_Delivery 
##                                  137.611436 
##                            Delivery_At_Home 
##                                  134.846212 
## Mothers_Who_Received_Any_Antenatal_Check_Up 
##                                    1.804854 
##     Mothers_Who_Had_Full_Antenatal_Check_Up 
##                                    1.567969 
##            Mothers_Who_Underwent_Ultrasound 
##                                    1.708771

As seen, Institutional Delivery and Delivery at Home are almost perferctly related. We remove the variable “Delivery At Home” as it provides the exact same information as “Instutitonal Delivery”

1.2 Remove undesired variables and Repeat 1.1

model <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up 
            + Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
pairs(~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up +
        Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)

Xmat<- model.matrix(model)
XTX <- solve(t(Xmat)%*%Xmat)
rankMatrix(XTX)

## [1] 5
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15

H <- Xmat%*%XTX%*%t(Xmat)
rankMatrix(H)

## [1] 5
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.306067e-14

summary(model)

## 
## Call:
## lm(formula = Infant_Mortality_Rate ~ Institutional_Delivery + 
##     Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Had_Full_Antenatal_Check_Up + 
##     Mothers_Who_Underwent_Ultrasound, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.039  -9.017   0.316   8.594  43.743 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 96.56772   11.52976   8.376
## Institutional_Delivery                       0.35512    0.05890   6.029
## Mothers_Who_Received_Any_Antenatal_Check_Up -0.56151    0.14658  -3.831
## Mothers_Who_Had_Full_Antenatal_Check_Up     -0.11935    0.10094  -1.182
## Mothers_Who_Underwent_Ultrasound            -0.30710    0.05796  -5.299
##                                             Pr(>|t|)    
## (Intercept)                                 2.71e-15 ***
## Institutional_Delivery                      5.21e-09 ***
## Mothers_Who_Received_Any_Antenatal_Check_Up 0.000158 ***
## Mothers_Who_Had_Full_Antenatal_Check_Up     0.238066    
## Mothers_Who_Underwent_Ultrasound            2.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.7 on 279 degrees of freedom
## Multiple R-squared:  0.2002, Adjusted R-squared:  0.1887 
## F-statistic: 17.46 on 4 and 279 DF,  p-value: 8.489e-13

vif(model)  ## The vif function is from the car package

##                      Institutional_Delivery 
##                                    1.830999 
## Mothers_Who_Received_Any_Antenatal_Check_Up 
##                                    1.791705 
##     Mothers_Who_Had_Full_Antenatal_Check_Up 
##                                    1.567646 
##            Mothers_Who_Underwent_Ultrasound 
##                                    1.707864

As VIF < 5 We accept the assumption that there isn’t multidisciplinary.

2. Data Cleaning

2.1 Cook’s Distance, DFFITS, DFBETAS

D <- cooks.distance(model)
which(D > qf(0.5, 5, 284-5))

## named integer(0)

dfits <- dffits(model)
which(abs(dfits) > 2*sqrt(5/284))

##   3   9 141 155 211 261 265 266 281 
##   3   9 141 155 211 261 265 266 281

dfb <- dfbetas(model)
which(abs(dfb[,2]) > 2/sqrt(284))

##   3   9  77  79  80  83  86  87 121 127 137 211 261 266 272 273 274 281 
##   3   9  77  79  80  83  86  87 121 127 137 211 261 266 272 273 274 281

We proceed to remove influential observations and refit the model.

df <- df[-c(3,9,77,79,80,83,86,87,121,127,137,141,155,211,261,265,266,272,273,274,281),]
model <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up 
            + Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)

3. Analysis Process

3.1 Normal Q-Q plot and Standard Residue Plot (Normality and Linearity Assumptions)

resid <- rstandard(model)
par(family = 'serif')
qqnorm(resid)
qqline(resid)

fitted <- predict(model)
plot(resid ~ fitted, type = "p", 
     xlab = "Fitted Values", ylab = "Standardized Residual", 
     cex.lab = 1.2, col = "red")
lines(lowess(fitted, resid), col = "blue")

3.2 Response vs Fitted values (Any Transformation Needed?)

par(family = 'serif')
plot(df$Infant_Mortality_Rate ~ fitted, type = "p", xlab = "Fitted Values", 
     ylab = "Infant Mortality Rate", cex.lab = 1.2,
     col = "red")
abline(lm(df$Infant_Mortality_Rate ~ fitted), lwd = 2, col = "blue")
lines(lowess(fitted, df$Infant_Mortality_Rate), col = "red")

We conclude that no transformation is needed for this model, or rather, we cannot find an appropriate transformation.

4. ANCOVA Test

Note that this model has no indicator variable dividing observations into two groups.

5. Model Selection

Keep in mind that our goal is to have an interpretable model, prediction accuracy is of secondary importance

5.1 Manual Investigation

criteria <- function(model){
  n <- length(model$residuals)
  p <- length(model$coefficients) - 1
  RSS <- sum(model$residuals^2)
  R2 <- summary(model)$r.squared
  R2.adj <- summary(model)$adj.r.squared
  AIC <- n*log(RSS/n) + 2*p
  AICc <- AIC + (2*(p+2)*(p+3))/(n-p-1)
  BIC <- n*log(RSS/n) + (p+2)*log(n)
  res <- c(R2, R2.adj, AIC, AICc, BIC)
  names(res) <- c("R Squared", "Adjsuted R Squared", "AIC", "AICc", "BIC")
  return(res)
}

## The crteria ##
## model 1 ##
model.full <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up 
              + Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
crit1 <- criteria(model = model.full)
## model 2 ##
model.3.1 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up + 
                  Mothers_Who_Had_Full_Antenatal_Check_Up, data = df)
crit2 <- criteria(model = model.3.1)
## model 3 ##
model.3.2 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up + 
                  Mothers_Who_Underwent_Ultrasound, data = df)
crit3 <- criteria(model = model.3.2)
## model 4 ##
model.3.3 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Had_Full_Antenatal_Check_Up + 
                   Mothers_Who_Underwent_Ultrasound, data = df)
crit4 <- criteria(model = model.3.3)
## model 5 ##
model.3.4 <- lm(Infant_Mortality_Rate ~ Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Had_Full_Antenatal_Check_Up
                 + Mothers_Who_Underwent_Ultrasound, data = df)
crit5 <- criteria(model = model.3.4)
## model 6 ##
model.2.1 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up, data = df)
crit6 <- criteria(model = model.2.1)
## model 7 ##
model.2.2 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Had_Full_Antenatal_Check_Up, data = df)
crit7 <- criteria(model = model.2.2)
## model 8 ##
model.2.3 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Underwent_Ultrasound, data = df)
crit8 <- criteria(model = model.2.3)

## Comapre the criteria for each model ##
rbind(crit1, crit2, crit3, crit4, crit5, crit6, crit7, crit8)

##        R Squared Adjsuted R Squared      AIC     AICc      BIC
## crit1 0.20312161         0.19076691 1265.981 1266.306 1291.413
## crit2 0.12795046         0.11784950 1287.688 1287.920 1309.549
## crit3 0.18951398         0.18012611 1268.434 1268.665 1290.294
## crit4 0.17289599         0.16331563 1273.772 1274.003 1295.632
## crit5 0.12530303         0.11517140 1288.486 1288.717 1310.346
## crit6 0.08696132         0.07993794 1297.769 1297.922 1316.057
## crit7 0.10266439         0.09576181 1293.206 1293.360 1311.495
## crit8 0.12061342         0.11384891 1287.892 1288.046 1306.181

The result is not promising, we proceed to the next step.

5.2 LASSO selection

df_select = data.frame(df$Infant_Mortality_Rate,
                       df$Institutional_Delivery,
                       df$Mothers_Who_Received_Any_Antenatal_Check_Up, 
                       df$Mothers_Who_Had_Full_Antenatal_Check_Up,
                       df$Mothers_Who_Underwent_Ultrasound)
model.select <- lm(df.Infant_Mortality_Rate ~ df.Institutional_Delivery + df.Mothers_Who_Received_Any_Antenatal_Check_Up 
             + df.Mothers_Who_Had_Full_Antenatal_Check_Up + df.Mothers_Who_Underwent_Ultrasound, data = df_select)

## choose lambda ##
set.seed(1006353848)
cv.out <- cv.glmnet(x = as.matrix(df_select[,2:5]), y = df_select$df.Infant_Mortality_Rate, standardize = T, alpha = 1)
plot(cv.out)

best.lambda <- cv.out$lambda.1se
best.lambda

## [1] 0.9303921

co<-coef(cv.out, s = "lambda.1se")

thresh <- 0.00
# select variables #
inds<-which(abs(co) > thresh )
variables<-row.names(co)[inds]
sel.var.lasso<-variables[!(variables %in% '(Intercept)')]
sel.var.lasso

## [1] "df.Institutional_Delivery"                     
## [2] "df.Mothers_Who_Received_Any_Antenatal_Check_Up"
## [3] "df.Mothers_Who_Had_Full_Antenatal_Check_Up"    
## [4] "df.Mothers_Who_Underwent_Ultrasound"

5.3 Step wise Selection based on AIC & BIC

## Based on AIC ##
n <- nrow(df_select)
sel.var.aic <- step(model.select, trace = 0, k = 2, direction = "both") 
sel.var.aic<-attr(terms(sel.var.aic), "term.labels")   
sel.var.aic

## [1] "df.Institutional_Delivery"                     
## [2] "df.Mothers_Who_Received_Any_Antenatal_Check_Up"
## [3] "df.Mothers_Who_Had_Full_Antenatal_Check_Up"    
## [4] "df.Mothers_Who_Underwent_Ultrasound"

## Based on BIC ##
n <- nrow(df_select)
sel.var.bic <- step(model.select, trace = 0, k = log(n), direction = "both") 
sel.var.bic<-attr(terms(sel.var.bic), "term.labels")   
sel.var.bic

## [1] "df.Institutional_Delivery"                     
## [2] "df.Mothers_Who_Received_Any_Antenatal_Check_Up"
## [3] "df.Mothers_Who_Underwent_Ultrasound"

6. Model Validation

We will divide our initial dataset into two seperate dataset, one for validation, and one for testing.

df_test  <- df_select[row.names(df) %in% 1:131, ]
df_valid <- df_select[row.names(df) %in% (134):nrow(df), ]

6.1 Cross Validation and prediction performance of AIC based selection

ols.aic <- ols(df.Infant_Mortality_Rate ~ ., data = df_valid[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))], 
               x=T, y=T, model = T)
aic.cross <- calibrate(ols.aic, method = "crossvalidation", B = 10) ## 10 fold cross validation   
plot(aic.cross, las = 1, xlab = "Predicted Infant Mortality Rate", ylab = "Observed Infant Mortality Rate", main = "Cross-Validation calibration with AIC")

## 
## n=125   Mean absolute error=2.252   Mean squared error=6.5517
## 0.9 Quantile of absolute error=3.291

pred.aic <- predict(ols.aic, newdata = df_test[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))]) ## Test Error ##
pred.error.AIC <- mean((df_test$df.Infant_Mortality_Rate - pred.aic)^2) ## Prediction error ##

6.2 Cross Validation and prediction performance of BIC based selection

ols.aic <- ols(df.Infant_Mortality_Rate ~ ., data = df_valid[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))], 
               x=T, y=T, model = T)
aic.cross <- calibrate(ols.aic, method = "crossvalidation", B = 10) ## 10 fold cross validation   
plot(aic.cross, las = 1, xlab = "Predicted Infant Mortality Rate", ylab = "Observed Infant Mortality Rate", main = "Cross-Validation calibration with BIC")

## 
## n=125   Mean absolute error=2.546   Mean squared error=8.92306
## 0.9 Quantile of absolute error=3.518

pred.aic <- predict(ols.aic, newdata = df_test[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))]) ## Test Error ##
pred.error.BIC <- mean((df_test$df.Infant_Mortality_Rate - pred.aic)^2) ## Prediction error ##

6.3 Cross Validation and prediction performance of LASSO based selection

ols.lasso <- ols(df.Infant_Mortality_Rate ~ ., data = df_valid[,which(colnames(df_valid) %in% c(sel.var.lasso, "df.Infant_Mortality_Rate"))], 
                 x=T, y=T, model = T)

## 10 fold cross validation ##    
lasso.cross <- calibrate(ols.lasso, method = "crossvalidation", B = 10)
plot(lasso.cross, las = 1, xlab = "Predicted Infant Mortality Rate", ylab = "Observed Infant Mortality Rate", main = "Cross-Validation calibration with LASSO")

## 
## n=125   Mean absolute error=2.308   Mean squared error=7.34532
## 0.9 Quantile of absolute error=3.987

pred.lasso <- predict(ols.lasso, newdata = df_test[,which(colnames(df_valid) %in% c(sel.var.lasso, "df.Infant_Mortality_Rate"))]) ## Test Error ##
pred.error.lasso <- mean((df_test$df.Infant_Mortality_Rate - pred.lasso)^2) ## Prediction error ##

print(c(pred.error.AIC, pred.error.BIC, pred.error.lasso))

## [1] 235.162 235.162 235.162

After model selection, we decide to apply the model from BIC based selection. LASSO and AIC include 4 predictors, BIC includes 3 predictor. The goal of this study is not to predcit, but to understand the relation. Hence, BIC based selection is desirble.

7. Analysis Process

model.final <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + 
                    Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
summary(model.final)

## 
## Call:
## lm(formula = Infant_Mortality_Rate ~ Institutional_Delivery + 
##     Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.0296  -7.8482   0.4369   7.4197  28.8795 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 98.73380    9.59736  10.288
## Institutional_Delivery                       0.29644    0.05578   5.314
## Mothers_Who_Received_Any_Antenatal_Check_Up -0.56311    0.12001  -4.692
## Mothers_Who_Underwent_Ultrasound            -0.30168    0.05270  -5.725
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## Institutional_Delivery                      2.31e-07 ***
## Mothers_Who_Received_Any_Antenatal_Check_Up 4.38e-06 ***
## Mothers_Who_Underwent_Ultrasound            2.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.11 on 259 degrees of freedom
## Multiple R-squared:  0.1895, Adjusted R-squared:  0.1801 
## F-statistic: 20.19 on 3 and 259 DF,  p-value: 8.673e-12

anova(model.final)

qf(0.95, 1, 282)

## [1] 3.874645

Again we failed to reject H0 for the predictor Institutional_Delivery. We can interpret this as “When conditioned on the percentage of Institutional Delivery, Infant Mortality Rate will not change.” Hence, we can conclude that there is no relationship between Institutional Delivery and Infant Mortality Rate. We successfully rejected H0 for the other two predictors. Surprisingly, the estimate indicates a negative relation.

A Study on the Relationship between the Percentage of Institutional Delivery and the Infant Mortality Rate in India

Zhongmang Cheng

December 20th, 2022

Exploratary Data Analysis

Part A - Univarate Model

0. Data Ingestion

1. Data Cleaning

1.1 Identifying influential points using Cooks Distance

1.2 Remove influential points and refit model

2. Diagnostic Checking

2.1 Normal Q-Q Plot (Normality Assumption)

2.2 Standardized Residue vs Predictor Plot (Linearity Assumption)

2.3 Observe Patterns in Variance (Homoscedasticity Assumption)

3. Analysis Process

3.1 Fitting the best line

3.2 Create Confidence Intervals

3.3 ANOVA Test for H0

Part B - Multivariate Model

0. Data Ingestion

1. Check Multicollinearity

1.1 Scatter Plot Matrix and VIF

1.2 Remove undesired variables and Repeat 1.1

2. Data Cleaning

2.1 Cook’s Distance, DFFITS, DFBETAS

3. Analysis Process

3.1 Normal Q-Q plot and Standard Residue Plot (Normality and Linearity Assumptions)

3.2 Response vs Fitted values (Any Transformation Needed?)

4. ANCOVA Test

5. Model Selection

5.1 Manual Investigation

5.2 LASSO selection

5.3 Step wise Selection based on AIC & BIC

6. Model Validation

6.1 Cross Validation and prediction performance of AIC based selection

6.2 Cross Validation and prediction performance of BIC based selection

6.3 Cross Validation and prediction performance of LASSO based selection

7. Analysis Process