Kapitel 77 Proportionen

In diesem Kapitel zeigen wir, dass man Proportionen auch mit generalisierten linearen Regressionen analysieren kann.

Wir zeigen folgende Beispiele:

Vergleich der Proportionen von zwei unabhängigen Gruppen
Vergleich der Proportionen von drei unabhängigen Gruppen
Vergleich der Proportionen von zwei abhängigen Gruppen
Vergleich der Proportionen von drei abhängigen Gruppen
Vergleich der Proportionen zweier Gruppen zu zwei Zeitpunkten

77.0.1 Wir benötigen folgendee Pakete

library(tidyverse)
library(lme4)
library(broom.mixed)
library(effects)
library(RCurl) ## to source() from Github
library(Statamarkdown)
library(qpcR)
library(boot)
library(tidymodels)
library(mosaic)

77.1 Proportionen von zwei unabhängigen Gruppen

Wir simulieren Daten zu der Proportion der Bevölkerung des Kantons Wallis und des Kantons Berns, die mindestens einmal pro Jahr gestürzt sind.

Anteil der Bevölkerung über 65 Jahren in Privathaushalten (mind. ein Sturz in den letzten 12 Monaten vor der Befragung) Wir simulieren die Daten basierend auf Informationen der Schweizer Gesundheitsobersvatorien https://ind.obsan.admin.ch/indicator/obsan/stuerze.

set.seed(666)
Proportion_Stürzer_Wallis<-sample(c(1,0), size=700, rep=TRUE, prob=c(0.253, (1-0.253)))

summarytools::freq(Proportion_Stürzer_Wallis)

Frequencies  
Proportion_Stürzer_Wallis  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    513     73.29          73.29     73.29          73.29
          1    187     26.71         100.00     26.71         100.00
       <NA>      0                               0.00         100.00
      Total    700    100.00         100.00    100.00         100.00

Proportion_Stürzer_Bern<-sample(c(1,0), size=2119, rep=TRUE, prob=c(0.271, (1-0.271)))

summarytools::freq(Proportion_Stürzer_Bern)

Frequencies  
Proportion_Stürzer_Bern  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   1534     72.39          72.39     72.39          72.39
          1    585     27.61         100.00     27.61         100.00
       <NA>      0                               0.00         100.00
      Total   2119    100.00         100.00    100.00         100.00

id<-1:max(c(length(Proportion_Stürzer_Wallis), length(Proportion_Stürzer_Bern)))

data<-data.frame(qpcR:::cbind.na(id, Proportion_Stürzer_Wallis,Proportion_Stürzer_Bern))

data<-data %>% 
  pivot_longer(cols=-id, 
               values_to="Fall", 
               names_to="Canton") %>% 
  mutate(id=1:length(id))
names(data)

[1] "id"     "Canton" "Fall"

Wir können jetzt diese Proportionen in einem einfachen generalisierten linearen Modell schätzen, indem wir eine binomiale Verteilung wählen.

bsp_1.glm<-glm(Fall~Canton, family=binomial, data=data)
summary(bsp_1.glm)


Call:
glm(formula = Fall ~ Canton, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8038  -0.8038  -0.8038   1.6044   1.6248  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -0.96402    0.04859 -19.839   <2e-16 ***
CantonProportion_Stürzer_Wallis -0.04515    0.09828  -0.459    0.646    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3309.8  on 2818  degrees of freedom
Residual deviance: 3309.6  on 2817  degrees of freedom
  (1419 observations deleted due to missingness)
AIC: 3313.6

Number of Fisher Scoring iterations: 4

jtools::summ(bsp_1.glm, confint=TRUE, exp=TRUE)

Observations	2819 (1419 missing obs. deleted)
Dependent variable	Fall
Type	Generalized linear model
Family	binomial
Link	logit

χ²(1)	0.21
Pseudo-R² (Cragg-Uhler)	0.00
Pseudo-R² (McFadden)	0.00
AIC	3313.61
BIC	3325.50

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.38	0.35	0.42	-19.84	0.00
CantonProportion_Stürzer_Wallis	0.96	0.79	1.16	-0.46	0.65
Standard errors: MLE

Proportion der Stürzer:innen im Wallis:

exp( -0.96402+-0.04515 )/(1+exp(-0.96402+-0.04515))

[1] 0.2671423

Vergleichen wir das mit den beobachteten Stürzen (Ja, stimmt):

summarytools::freq(Proportion_Stürzer_Wallis)

Frequencies  
Proportion_Stürzer_Wallis  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    513     73.29          73.29     73.29          73.29
          1    187     26.71         100.00     26.71         100.00
       <NA>      0                               0.00         100.00
      Total    700    100.00         100.00    100.00         100.00

Proportion der Stürzer:innen im Kanton Bern (stimmt auch):

summary(bsp_1.glm)


Call:
glm(formula = Fall ~ Canton, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8038  -0.8038  -0.8038   1.6044   1.6248  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -0.96402    0.04859 -19.839   <2e-16 ***
CantonProportion_Stürzer_Wallis -0.04515    0.09828  -0.459    0.646    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3309.8  on 2818  degrees of freedom
Residual deviance: 3309.6  on 2817  degrees of freedom
  (1419 observations deleted due to missingness)
AIC: 3313.6

Number of Fisher Scoring iterations: 4

exp( -0.96402)/(1+exp( -0.96402))

[1] 0.276074

summarytools::freq(Proportion_Stürzer_Bern)

Frequencies  
Proportion_Stürzer_Bern  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   1534     72.39          72.39     72.39          72.39
          1    585     27.61         100.00     27.61         100.00
       <NA>      0                               0.00         100.00
      Total   2119    100.00         100.00    100.00         100.00

Wir könnten das Modell auch ohne Intercept ausgeben lassen, was die Berechnung noch etwas vereinfacht:

bsp_1.glm_ni<-glm(Fall~Canton-1, family=binomial, data=data)
summary(bsp_1.glm_ni)


Call:
glm(formula = Fall ~ Canton - 1, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8038  -0.8038  -0.8038   1.6044   1.6248  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
CantonProportion_Stürzer_Bern   -0.96402    0.04859  -19.84   <2e-16 ***
CantonProportion_Stürzer_Wallis -1.00917    0.08542  -11.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3908.0  on 2819  degrees of freedom
Residual deviance: 3309.6  on 2817  degrees of freedom
  (1419 observations deleted due to missingness)
AIC: 3313.6

Number of Fisher Scoring iterations: 4

# Proportion Stürzer:innen Bern: 

exp(-0.96402)/(1+exp(-0.96402))

[1] 0.276074

# Proportion Stürzer:innen Wallis: 

exp( -1.00917)/(1+exp( -1.00917))

[1] 0.2671423

Soweit so gut; wir möchten jetzt aber testen, ob die Proportionen unterschiedlich sind. Auch das haben wir eigentlisch schon gemacht. Wir können die Differenz aus dem Regressionsmodell ablesen, allerdings alls Odds Ratio. Wir müssen nur den Koeffizienten für das Wallis exponenzieren. Wir sehen, dass die Odds Stürzer:in zu sein im Wallis 0.91 Mal kleiner ist.

Der Koeffizient für das Wallis können wir jedoch nicht direkt benutzen, da dies eben nicht der absoluten Differenz entspricht, sondern der relativen Differenz, respektive: Der Koeffizient ist die Log Odds Ratio und die Odds Ratio.

summary(bsp_1.glm)


Call:
glm(formula = Fall ~ Canton, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8038  -0.8038  -0.8038   1.6044   1.6248  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -0.96402    0.04859 -19.839   <2e-16 ***
CantonProportion_Stürzer_Wallis -0.04515    0.09828  -0.459    0.646    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3309.8  on 2818  degrees of freedom
Residual deviance: 3309.6  on 2817  degrees of freedom
  (1419 observations deleted due to missingness)
AIC: 3313.6

Number of Fisher Scoring iterations: 4

exp(-0.090698)

[1] 0.9132935

bsp_1.glm_rr<-glm(Fall~Canton, family=poisson, data=data)
summary(bsp_1.glm_rr)


Call:
glm(formula = Fall ~ Canton, family = poisson, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7431  -0.7431  -0.7431   1.0613   1.0836  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.28709    0.04134 -31.132   <2e-16 ***
CantonProportion_Stürzer_Wallis -0.03288    0.08400  -0.391    0.695    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1999.7  on 2818  degrees of freedom
Residual deviance: 1999.6  on 2817  degrees of freedom
  (1419 observations deleted due to missingness)
AIC: 3547.6

Number of Fisher Scoring iterations: 5

Die Proportion Stürzer:innen Bern:

exp(-1.28709)

[1] 0.276073

Die Proportion Stürzer:innen Wallis:

exp(-1.28709 + -0.03288 )

[1] 0.2671433

Das Berechnen der Differenz der Proportionen ist ja eigentlich kein Problem, und das Konfidenzintervall könnten wir mit folgender Formel berechnen https://www.statology.org/confidence-interval-difference-in-proportions/#:~:text=A%20confidence%20interval%20(C.I.)%20for,a%20certain%20level%20of%20confidence..)

\[\begin{equation} z \cdot({p1(1-p1)/n1 + p2(1-p2)/n2)}^2 \tag{77.1} \end{equation}\]

z = 1.96 bei 95% Konfidenzintervall
n1 = Stichprobengrösse Gruppe 1
n2 = Stichprobengrösse Gruppe 2
p1 = Proportion Gruppe 1
p2 = Proportion Gruppe 2

Also bei uns:

\[\begin{equation} 1.96 \cdot({p1(1-p1)/n1 + p2(1-p2)/n2)}^2 \tag{77.2} \end{equation}\]

Confidence interval = (p1–p2) +/- z*√(p1(1-p1)/n1 + p2(1-p2)/n2)

Eine relativ einfache Möglichkeit, ist die Berechnung eines Bootstrapped Confidence Intervals Code inspiriert von hier: https://www.r-bloggers.com/2021/06/estimating-a-risk-difference-and-confidence-intervals-using-logistic-regression/ und hier: https://www.statology.org/bootstrapping-in-r/.

Wollen wir dies als Differenz der Proportion ausdrücken, müssten wir das wie folgt berechnen

Wichtig: Das Bootstrapping wird nicht funktionieren, wenn es Missing Values in den Daten hat. Deswegen müssen wir zuerst ein kompletes Datenset erstellen.

data_complete<-data %>% 
  dplyr::filter(!is.na(Fall))

diff_function <- function(data,formula, indices) {
    glmfit <- glm(formula, data=data[indices,], family = "binomial")
    p0=exp(bsp_1.glm$coefficients[1])/(1+exp(bsp_1.glm$coefficients[1]))
    p1=exp(bsp_1.glm$coefficients[1]+bsp_1.glm$coefficients[2])/(1+exp(bsp_1.glm$coefficients[1]+bsp_1.glm$coefficients[2]))
 return(p1 - p0)
    }

prop.boot <- boot(data=data_complete, statistic=diff_function, R=2000, formula=Fall~Canton)

boot.ci(prop.boot, type = c("perc"),index = 1)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates

CALL : 
boot.ci(boot.out = prop.boot, type = c("perc"), index = 1)

Intervals : 
Level     Percentile     
95%   (-0.0089, -0.0089 )  
Calculations and Intervals on Original Scale

glmfit <- glm(Fall~Canton, data=data, family = "binomial")

  data_complete$fitted <- fitted(glmfit)

-diff(with(data_complete, tapply(fitted, Canton, mean)))

Proportion_Stürzer_Wallis 
              0.008930762

Wir können das jetzt noch mit dem Resultat des prop.test vergleichen.

prop.test(matrix(table(data$Fall, data$Canton)))

Warning in stats::prop.test(x = count, n = n, p = p, alternative =
alternative, : Chi-squared approximation may be incorrect


    1-sample proportions test with continuity correction

data:  matrix  [with success = 187]table(data$Fall, data$Canton)  [with success = 187]
X-squared = 0.25, df = 1, p-value = 0.6171
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.01319116 0.78057348
sample estimates:
   p 
0.25

Ansatz mit dem Paket margins.

library(margins)
margins(glmfit)

Average marginal effects

glm(formula = Fall ~ Canton, family = "binomial", data = data)

 CantonProportion_Stürzer_Wallis
                       -0.008931

77.2 Vergleich der Proportionen von drei unabhängigen Gruppen

https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R6_CategoricalDataAnalysis/R6_CategoricalDataAnalysis6.html und diesen Daten hier.

Wir simulieren noch neue Daten für eine dritte Gruppe. Basierend auf dieser Webseite (hier klicken) simulieren wir die Proportion der Stürzer:innen für den Kanton Appenzell Ausserrhoden.

Proportion_Stürzer_AR<-sample(c(1,0), size=11054, rep=TRUE, prob=c(0.301, (1-0.301)))
data<-data.frame(qpcR:::cbind.na(id, Proportion_Stürzer_Wallis,Proportion_Stürzer_Bern, Proportion_Stürzer_AR))
data_long<-data %>% 
  pivot_longer(cols=-id, 
               names_to="Canton", 
               values_to="Fall") %>% 
  dplyr::filter(!is.na(Fall))

77.3 Wir könnten die drei Proportion auch wieder mit dem prop.test Befehl analysieren. Verschiedene Pakete benutzen diesen Befehl und ermöglichen eine einfachere Anwendung, so dass man den Befehl direkt mit einem Datensatz individueller Daten benutzen kann. Wir benutzen hier das Paket mosaic.

Wir müssen jedoch zuerst die fehlenden Werte in der Variable Fall entfernen (diese sind beim pivot_longer entstanden, weil nicht alle Kantone gleich viele Einwohner haben.)

mosaic::prop.test(Fall~Canton,data=data_long, success=1)


    3-sample test for equality of proportions without continuity correction

data:  tally(Fall ~ Canton)
X-squared = 12.285, df = 2, p-value = 0.00215
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3 
0.3074905 0.2760736 0.2671429

77.4 Vergleich von mehr als zwei Proportionen mit einem generalisierten linearen Modell.

Wichtig ist hier unsere Nullhypothese - oft ist uns gar nicht richtig klar, was unsere Hypothese ist. Wir müssen diese jedoch ganz klar formulieren, um den richtigen Test auszuwählen. Hier ein paar Beispiele

$Stürze Wallis = Stürze Bern = Stürze Appenzell Aussenrhoden $
$\rho Stürze Appenzell Aussenrhoden = \rho Stürze Wallis$
$\rho Stürze Bern > \rho Stürze Wallis > \rho Stürze Appenzell Aussenrhoden$

glmfit <- glm(Fall~Canton, data=data_long, family = "binomial")

summary(glmfit)


Call:
glm(formula = Fall ~ Canton, family = "binomial", data = data_long)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8572  -0.8572  -0.8572   1.5358   1.6248  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -0.81188    0.02061 -39.389  < 2e-16 ***
CantonProportion_Stürzer_Bern   -0.15214    0.05278  -2.882  0.00395 ** 
CantonProportion_Stürzer_Wallis -0.19729    0.08787  -2.245  0.02476 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16964  on 13872  degrees of freedom
Residual deviance: 16952  on 13870  degrees of freedom
AIC: 16958

Number of Fisher Scoring iterations: 4

data_long$fitted <- fitted(glmfit)

-diff(with(data_long, tapply(fitted, Canton, mean)))

  Proportion_Stürzer_Bern Proportion_Stürzer_Wallis 
              0.031416882               0.008930762

margins::margins(glmfit)

Average marginal effects

glm(formula = Fall ~ Canton, family = "binomial", data = data_long)

 CantonProportion_Stürzer_Bern CantonProportion_Stürzer_Wallis
                             0                               0

77.5 Events / Total Forma

Sie werden im Internet in Beispielen zu Proportionen und glm folgendes Format finden:

data_total_events<-data_long %>% 
  group_by(Canton) %>% 
  summarise(Total=n(),
            Events=sum(Fall))

head(data_total_events)

# A tibble: 3 × 3
  Canton                    Total Events
  <chr>                     <int>  <dbl>
1 Proportion_Stürzer_AR     11054   3399
2 Proportion_Stürzer_Bern    2119    585
3 Proportion_Stürzer_Wallis   700    187

Diese Daten werden so analysiert:

out <- glm(cbind(Events, Total-Events) ~ Canton,
           family=binomial(logit), data=data_total_events)
summary(out)


Call:
glm(formula = cbind(Events, Total - Events) ~ Canton, family = binomial(logit), 
    data = data_total_events)

Deviance Residuals: 
[1]  0  0  0

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -0.81188    0.02061 -39.389  < 2e-16 ***
CantonProportion_Stürzer_Bern   -0.15214    0.05278  -2.882  0.00395 ** 
CantonProportion_Stürzer_Wallis -0.19729    0.08787  -2.245  0.02476 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  1.2460e+01  on 2  degrees of freedom
Residual deviance: -1.3816e-12  on 0  degrees of freedom
AIC: 30.248

Number of Fisher Scoring iterations: 2

Falls die Daten nicht als Total / Events gespeichert sind, sondern als Events / Non-Events ist es noch einfacher:

data_events_nonevents<-data_long %>% 
  group_by(Canton) %>% 
  summarise(Events=sum(Fall), 
            Non_Events=n()-Events)

out <- glm(cbind(Events, Non_Events) ~ Canton,
           family=binomial(logit), data=data_events_nonevents)
summary(out)


Call:
glm(formula = cbind(Events, Non_Events) ~ Canton, family = binomial(logit), 
    data = data_events_nonevents)

Deviance Residuals: 
[1]  0  0  0

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -0.81188    0.02061 -39.389  < 2e-16 ***
CantonProportion_Stürzer_Bern   -0.15214    0.05278  -2.882  0.00395 ** 
CantonProportion_Stürzer_Wallis -0.19729    0.08787  -2.245  0.02476 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  1.2460e+01  on 2  degrees of freedom
Residual deviance: -1.3816e-12  on 0  degrees of freedom
AIC: 30.248

Number of Fisher Scoring iterations: 2

Falls Sie dies in Stata tun möchten, müssen Sie wie folgt vorgehen (hier klicken).

77.5.1 Trend Analysieren

Wenn wir die Hypothese testen möchten, dass es einen Trend der Proportionen über eine bestimmte Einordnung der Kategorien gibt, können wir dies wie folgt tun (https://rdrr.io/r/stats/prop.trend.test.html).

(res.prop<-prop.trend.test(data_events_nonevents$Events, data_events_nonevents$Non_Events, score = seq_along(data_events_nonevents$Canton)))


    Chi-squared Test for Trend in Proportions

data:  data_events_nonevents$Events out of data_events_nonevents$Non_Events ,
 using scores: 1 2 3
X-squared = 28.098, df = 1, p-value = 1.153e-07

res.prop$statistic

X-squared 
 28.09771

res.prop$parameter

df 
 1

res.prop$p.value

[1] 1.153424e-07

https://stats.stackexchange.com/questions/567841/testing-trends-in-proportions-with-r-prop-trend-test-vs-rstatixprop-trend-tes.

Es gibt auch eine Funktion prop_trend_test aus dem rstatix Paket. Diese benötigt als Input die Fälle und die Nicht-Fälle.

rstatix::prop_trend_test(as.matrix(data_events_nonevents))

Error in sum(xtab): invalid 'type' (character) of argument

77.6 Proportionen schätzen: weitere Beispiele

set.seed(1245)
Proportion_HAD_Methode_A = sort(sample(c(1,0), size=5000, rep=TRUE, prob=c(0.2, 0.8)))
Proportion_HAD_Methode_B = sort(sample(c(1,0), size=5000, rep=TRUE, prob=c(0.25, 0.75)))
Proportion_HAD_Methode_C = sort(sample(c(1,0), size=5000, rep=TRUE, prob=c(0.3, 0.7)))

Proportion_HAD_Methode_A_2 = (sample(c(1,0), size=500, rep=TRUE, prob=c(0.2, 0.8)))
Proportion_HAD_Methode_B_2 = (sample(c(1,0), size=500, rep=TRUE, prob=c(0.25, 0.75)))
Proportion_HAD_Methode_C_2 = (sample(c(1,0), size=500, rep=TRUE, prob=c(0.3, 0.7)))

Proportion_HAD_Methode_A = c(Proportion_HAD_Methode_A ,Proportion_HAD_Methode_A_2)
Proportion_HAD_Methode_B = c(Proportion_HAD_Methode_B ,Proportion_HAD_Methode_B_2)
Proportion_HAD_Methode_C = c(Proportion_HAD_Methode_C ,Proportion_HAD_Methode_C_2)


id=rep(1:length(Proportion_HAD_Methode_A))
data<-data.frame(id,Proportion_HAD_Methode_A,Proportion_HAD_Methode_B, Proportion_HAD_Methode_C)
psych::pairs.panels(data)

Proportionen schätzen

# Proportion_HAD_Methode_A = (sample(c(0,1), size=1000, rep=TRUE, prob=c(0.2, 0.8)))
# Proportion_HAD_Methode_B = (sample(c(0,1), size=1000, rep=TRUE, prob=c(0.25, 0.75)))
# Proportion_HAD_Methode_C = (sample(c(0,1), size=1000, rep=TRUE, prob=c(0.3, 0.7)))
# 
# id=rep(1:100)
# data<-data.frame(id,Proportion_HAD_Methode_A,Proportion_HAD_Methode_B, Proportion_HAD_Methode_C)
# 
# psych::pairs.panels(data)

summarytools::freq(Proportion_HAD_Methode_A)

Frequencies  
Proportion_HAD_Methode_A  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   4409     80.16          80.16     80.16          80.16
          1   1091     19.84         100.00     19.84         100.00
       <NA>      0                               0.00         100.00
      Total   5500    100.00         100.00    100.00         100.00

summarytools::freq(Proportion_HAD_Methode_B)

Frequencies  
Proportion_HAD_Methode_B  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   4114     74.80          74.80     74.80          74.80
          1   1386     25.20         100.00     25.20         100.00
       <NA>      0                               0.00         100.00
      Total   5500    100.00         100.00    100.00         100.00

summarytools::freq(Proportion_HAD_Methode_C)

Frequencies  
Proportion_HAD_Methode_C  
Type: Numeric  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0   3867     70.31          70.31     70.31          70.31
          1   1633     29.69         100.00     29.69         100.00
       <NA>      0                               0.00         100.00
      Total   5500    100.00         100.00    100.00         100.00

data_long<-data %>% 
  pivot_longer(cols=-id, 
               names_to="methods", 
               values_to="HAD") %>% 
  mutate(id=factor(id))

rio::export(data_long, "data_long_proportion.dta")

Wir könnten die Proportionen jetzt mit einem glm Modell berechnen (wir nehmen als Verteilung die Binomialverteilung). Die Proportionen stimmen, nicht jedoch die Konfizenzintervalle.

fit.1<-glm(HAD~factor(methods), data=data_long, family='binomial')

summarytools::freq(data_long$HAD)

Frequencies  
data_long$HAD  
Type: Numeric  

               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------- --------- -------------- --------- --------------
          0   12390     75.09          75.09     75.09          75.09
          1    4110     24.91         100.00     24.91         100.00
       <NA>       0                               0.00         100.00
      Total   16500    100.00         100.00    100.00         100.00

(res<-summary(fit.1))


Call:
glm(formula = HAD ~ factor(methods), family = "binomial", data = data_long)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8394  -0.7620  -0.6650  -0.6650   1.7987  

Coefficients:
                                        Estimate Std. Error z value Pr(>|z|)
(Intercept)                             -1.39655    0.03381 -41.301  < 2e-16
factor(methods)Proportion_HAD_Methode_B  0.30858    0.04591   6.721  1.8e-11
factor(methods)Proportion_HAD_Methode_C  0.53449    0.04488  11.909  < 2e-16
                                           
(Intercept)                             ***
factor(methods)Proportion_HAD_Methode_B ***
factor(methods)Proportion_HAD_Methode_C ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 18524  on 16499  degrees of freedom
Residual deviance: 18380  on 16497  degrees of freedom
AIC: 18386

Number of Fisher Scoring iterations: 4

Aus den Koeffizienten können wir jetzt die Proportionen berechnen. Für die Methode A, dies ist die Referenzkategorie, nehmen wir einfach den Koeffizienten des Intercepts.

exp(res$coefficients[1])/(1+exp(res$coefficients[1]))

[1] 0.1983636

Hier die Berechnung der Proportion für die Methode B müssen wir den Koeffizienten des Interceptes und den Koeffizienten der Methode B zusammenzählen, bevor wir exponenzieren.

exp(res$coefficients[2]+res$coefficients[1])/(1+exp(res$coefficients[2]+res$coefficients[1]))

[1] 0.252

Und hier die Berechnung der Proportion für die Methode C

exp(res$coefficients[3]+res$coefficients[1])/(1+exp(res$coefficients[3]+res$coefficients[1]))

[1] 0.2969091

Damit die Konfidenzintervalle stimmen, müssen wir ein Modell benutzen, das berücksichtigit, dass wir im Datensatz ja jede Person drei Mal drinhaben - gemessen mit jeder der Methoden.

Zuerst machen wir es einmal falsch - im ersten Versuch nehmen wir die Messwiederholung nicht als zufälligen Faktor mit ins Modell, das heisst, wir haben nur ein random-intercept Modell.

fit.m1<-lme4::glmer(HAD~methods+(1|id), data=data_long, family=binomial)
(res<-summary(fit.m1))

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: HAD ~ methods + (1 | id)
   Data: data_long

     AIC      BIC   logLik deviance df.resid 
  9720.4   9751.2  -4856.2   9712.4    16496 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6868 -0.0121 -0.0049 -0.0014  5.9438 

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 337.3    18.36   
Number of obs: 16500, groups:  id, 5500

Fixed effects:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -13.0944     0.2703  -48.45   <2e-16 ***
methodsProportion_HAD_Methode_B   2.5284     0.1605   15.76   <2e-16 ***
methodsProportion_HAD_Methode_C   4.3293     0.1870   23.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) mP_HAD_M_B
mtP_HAD_M_B -0.607           
mtP_HAD_M_C -0.826  0.690

summarytools::freq(data_long$HAD)

Frequencies  
data_long$HAD  
Type: Numeric  

               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------- --------- -------------- --------- --------------
          0   12390     75.09          75.09     75.09          75.09
          1    4110     24.91         100.00     24.91         100.00
       <NA>       0                               0.00         100.00
      Total   16500    100.00         100.00    100.00         100.00

Proportion von HAD mit der Methode A (Referenz-Kategorie)

exp(res$coefficients[1])/(1+exp(res$coefficients[1]))

[1] 2.056775e-06

Proportion von HAD mit der Methode B

exp(res$coefficients[2]+res$coefficients[1])/(1+exp(res$coefficients[2]+res$coefficients[1]))

[1] 2.577831e-05

Wir sehen, dass die Proportionen nicht stimmen.

Damit die Proportionen stimmen, müssen wir zusätzlich ein random slope Modell berechnen.

control=lme4::glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5))
fit.random_intercept_plus_random_slope<-lme4::glmer(HAD~methods+(1+methods|id), data=data_long, family=binomial, control=control)

Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 3.96442 (tol = 0.002, component 1)

Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

(res<-summary(fit.random_intercept_plus_random_slope))

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: HAD ~ methods + (1 + methods | id)
   Data: data_long
Control: control

     AIC      BIC   logLik deviance df.resid 
  9666.7   9736.1  -4824.3   9648.7    16491 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-0.69715 -0.13107 -0.00125 -0.00011  1.71521 

Random effects:
 Groups Name                            Variance Std.Dev. Corr       
 id     (Intercept)                     1178.0   34.32               
        methodsProportion_HAD_Methode_B  578.1   24.04     0.16      
        methodsProportion_HAD_Methode_C 1111.5   33.34    -1.00 -0.15
Number of obs: 16500, groups:  id, 5500

Fixed effects:
                                  Estimate Std. Error    z value Pr(>|z|)    
(Intercept)                     -1.112e+01  1.093e-04 -101693.53   <2e-16 ***
methodsProportion_HAD_Methode_B  7.297e+00  2.899e-01      25.17   <2e-16 ***
methodsProportion_HAD_Methode_C  1.001e+01  1.093e-04   91539.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) mP_HAD_M_B
mtP_HAD_M_B 0.000            
mtP_HAD_M_C 0.001  0.000     
optimizer (bobyqa) convergence code: 0 (OK)
Model failed to converge with max|grad| = 3.96442 (tol = 0.002, component 1)
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?
Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

summarytools::freq(data_long$HAD)

Frequencies  
data_long$HAD  
Type: Numeric  

               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------- --------- -------------- --------- --------------
          0   12390     75.09          75.09     75.09          75.09
          1    4110     24.91         100.00     24.91         100.00
       <NA>       0                               0.00         100.00
      Total   16500    100.00         100.00    100.00         100.00

Mit den Standard-Einstellungen konvergiert das Modell nicht. Lösungen hier:https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html.

library(lme4)
ss <- getME(fit.random_intercept_plus_random_slope,c("theta","fixef"))
update <- update(fit.random_intercept_plus_random_slope,start=ss,control=glmerControl(optimizer="bobyqa",
                            optCtrl=list(maxfun=2e5)))

Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.479073 (tol = 0.002, component 1)

Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

(res<-summary(update))

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: HAD ~ methods + (1 + methods | id)
   Data: data_long
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  9613.2   9682.6  -4797.6   9595.2    16491 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-0.68929 -0.14891 -0.00209 -0.00006  1.71039 

Random effects:
 Groups Name                            Variance Std.Dev. Corr       
 id     (Intercept)                     1396.1   37.36               
        methodsProportion_HAD_Methode_B  539.3   23.22    -0.36      
        methodsProportion_HAD_Methode_C 1324.2   36.39    -1.00  0.37
Number of obs: 16500, groups:  id, 5500

Fixed effects:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.209e+01  1.477e-04  -81845   <2e-16 ***
methodsProportion_HAD_Methode_B  7.761e+00  1.477e-04   52535   <2e-16 ***
methodsProportion_HAD_Methode_C  1.098e+01  1.477e-04   74302   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) mP_HAD_M_B
mtP_HAD_M_B 0.000            
mtP_HAD_M_C 0.000  0.000     
optimizer (bobyqa) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.479073 (tol = 0.002, component 1)
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

summarytools::freq(data_long$HAD)

Frequencies  
data_long$HAD  
Type: Numeric  

               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------- --------- -------------- --------- --------------
          0   12390     75.09          75.09     75.09          75.09
          1    4110     24.91         100.00     24.91         100.00
       <NA>       0                               0.00         100.00
      Total   16500    100.00         100.00    100.00         100.00

Immer noch nicht konvergiert, wir müssen jetzt verschiedene Optimizer benutzen.

afurl <- "https://raw.githubusercontent.com/lme4/lme4/master/misc/issues/allFit.R"
eval(parse(text=getURL(afurl)))

Error in parse(text = getURL(afurl)): <text>:1:10: unexpected symbol
1: 404: Not Found
             ^

aa <- allFit(fit.random_intercept_plus_random_slope)

Loading required namespace: dfoptim

Loading required namespace: optimx

bobyqa :

Error in ctrl$optimizer <- optimizer[..i]: object of type 'symbol' is not subsettable

Proportion von HAD mit der Methode A (Referenz-Kategorie)

exp(res$coefficients[1])/(1+exp(res$coefficients[1]))

[1] 5.611819e-06

Proportion von HAD mit der Methode B

exp(res$coefficients[2]+res$coefficients[1])/(1+exp(res$coefficients[2]+res$coefficients[1]))

[1] 0.012996

Proportion von HAD mit der Methode C

exp(res$coefficients[3]+res$coefficients[1])/(1+exp(res$coefficients[3]+res$coefficients[1]))

[1] 0.2470705

Übrigens: Das Paket broom.mixed kann auch hilfreich sein, wenn wir die Modelle in eine Tabelle bringen möchten.

library(broom.mixed)
results<-tidy(fit.random_intercept_plus_random_slope,conf.int=TRUE,exponentiate=TRUE,effects="fixed") %>% 
  mutate(proportion=)

Error in is_data_pronoun(expr): argument "expr" is missing, with no default

Das Berechnen der Proportionen ist ja sehr einfach, die Konfidenzintervalle der Proportionen zu berechnen wäre manuell etwas schwieriger. Deswegen benutzen wir hier folgenden Weg https://stats.stackexchange.com/questions/70817/how-to-calculate-estimated-proportions-and-their-confidence-intervals-from-a-mix.

Wir benutzen das Paket effects siehe hier für mehr Infomration.

est<-Effect(c("methods","repeated_measure"),fit.random_intercept_plus_random_slope)

Error in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the following predictor is not in the model: repeated_measure

data_proportions_ci<-cbind(est$x,est$fit,est$se,est$lower,est$upper)

Error in cbind(est$x, est$fit, est$se, est$lower, est$upper): object 'est' not found

data_proportions_ci<-data_proportions_ci %>% 
  mutate(proportion=exp(est$fit)/(1+exp(est$fit)))

Error in mutate(., proportion = exp(est$fit)/(1 + exp(est$fit))): object 'data_proportions_ci' not found

Die Proportionen stimmen immer noch nicht. ## Stata Siehe mehr dazu hier: https://www.statalist.org/forums/forum/general-stata-discussion/general/1331454-comparing-proportions-between-separate-variables.

*use data_long_proportion.dta
*describe
*encode methods, gen(methods_num)
*melogit HAD i.methods_num || id:
 * matrix list e(b)
*dis exp(_b[_cons] + _b[2.methods_num]) / (1+exp(_b[_cons] + _b[2.methods_num]))
* margins i.methods_num

* use data_long_proportion.dta
* 
* encode methods, gen(methods_num)
* meglm HAD i.methods_num || id:, family(nbinomial) link(log) nolog  intpoints(18)  
* margins i.methods_num