Nous allons travailler sur la variable crimes contre la propriété regroupant les incendies volontaires, larcins, cambriolages, vols de voitures ainsi que les vols dans un véhicule à moteur.
Pour chacune, nous allons dans un premier temps réaliser une régression linéaire et éliminer certaines variables par critère AIC pour voir lesquelles contribuent au modèle. Puis, dans un second temps tester si chaque variable a une répartition de type Poisson. Ensuite, nous réaliserons une régression de type “Lasso” via le package glmnet. Enfin, si le modèle sélectionne trop de variables nous testerons si nous sommes en présence de surdispersion via la fonction dispersiontest. Enfin, nous récupérerons les variables séléctionnées par la régression Lasso “Poisson” et “Quasi-Poisson”.
En régression linéaire il existe deux grands problèmes.
Soit \(R^2_{j}\) est le coefficient de détermination de la régression de \(x_{j}\) sur les (p-1) autres variables. Il peut exister un problème de colinéarité : \(R^2_{j} \approx 1\) implique que le facteur d’inflation de la variance \(v_{j}=\frac{1}{1-R^2} \approx +\infty\).
D’autre part, il est important de s’intéresser au SCR (somme des carrés des résidus) : indicateur de qualité de la régression, divisé par les degrés de liberté. Il peut exister un problème de dimensionnalité : \(p \approx n\) implique \(\hat{\sigma}_{\epsilon}^2 = \frac{\sum_{i=1}^n \hat{\epsilon}^2_{i}}{n-p} \approx +\infty\) ainsi, \((p>n)\) implique \((X'X)\) n’est pas inversible.
Ces problèmes entraînent une variance élevée de l’estimation c.-à-d. les coefficients estimés sont très erratiques, exagérèment dépendants de l’échantillon d’apprentissage.
Nous sélectionnons les variables suivantes :
M
## [1] "TWO_OR_MORE" "MALE"
## [3] "AGE_30_TO_39" "AGE_40_TO_49"
## [5] "TTL_AGE_3_PLUS_ENRSTATUS" "NOT_ENROLLED"
## [7] "TTLPOP_25PLUS_EDU" "TTLPOP_5PLUS_LNG"
## [9] "HH_INC_125000_149999" "NATIVE"
Nous sélectionnons les variables suivantes :
Nous voulons dans cette partie effectuer une régression et éliminer certaines variables par critère AIC pour voir lesquelles contribuent au modèle.
library(MASS)
library(knitr)
library(kableExtra)
Mat_crime <- df_survey[,M]
Mat_crime$nb_burglary <- df_survey$nb_burglary
reg <- lm(nb_burglary~., data = Mat_crime)
summary(reg)
##
## Call:
## lm(formula = nb_burglary ~ ., data = Mat_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.439 -71.327 -6.218 68.841 220.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 250.63794 46.94089 5.339 7.4e-06 ***
## TWO_OR_MORE 0.41145 0.14148 2.908 0.00656 **
## MALE 0.26416 0.11846 2.230 0.03290 *
## AGE_30_TO_39 0.09930 0.06004 1.654 0.10794
## AGE_40_TO_49 0.28414 0.08981 3.164 0.00340 **
## TTL_AGE_3_PLUS_ENRSTATUS -0.69417 0.29844 -2.326 0.02651 *
## NOT_ENROLLED 0.21136 0.11468 1.843 0.07460 .
## TTLPOP_25PLUS_EDU -0.18458 0.08965 -2.059 0.04772 *
## TTLPOP_5PLUS_LNG 0.50880 0.30041 1.694 0.10004
## HH_INC_125000_149999 0.07142 0.22695 0.315 0.75505
## NATIVE -0.01223 0.03455 -0.354 0.72568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.3 on 32 degrees of freedom
## Multiple R-squared: 0.8295, Adjusted R-squared: 0.7762
## F-statistic: 15.56 on 10 and 32 DF, p-value: 1.234e-09
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
kable(affichage,col.names = "Variables sélectionnées par critère AIC")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par critère AIC |
---|
(Intercept) |
TWO_OR_MORE |
MALE |
AGE_40_TO_49 |
TTL_AGE_3_PLUS_ENRSTATUS |
TTLPOP_25PLUS_EDU |
Ici, le \(R^2=0.8295\) ce qui est proche de 1. Ainsi, nous devons faire attention à une possible colinéarité dans nos données.
Les variables sélectionnées par critère AIC sont :
Testons si chaque variable a une répartition de type Poisson.
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(knitr)
n <- sum(df_survey$nb_burglary)
p <- plot_ly(x = ~df_survey$nb_burglary,type = "histogram")
p
p <- plot_ly(x = ~rpois(n,lambda = 1), type = "histogram")
p
Il semblerait que la variable nombre de cambriolages ait une répartition de type poisson avec un lambda proche de 1.
Ensuite, nous réalisons une régression de type “Lasso” via le package glmnet.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library(kableExtra)
lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
TWO_OR_MORE |
AGE_30_TO_39 |
HH_INC_125000_149999 |
Les variables sélectionnées par la régression Lasso sont :
glm <- glm(nb_burglary~., data = Mat_crime, family = "poisson")
summary(glm)
##
## Call:
## glm(formula = nb_burglary ~ ., family = "poisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.033 -3.116 -0.879 3.613 10.391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.735e+00 1.832e-02 313.002 < 2e-16 ***
## TWO_OR_MORE 5.496e-04 5.034e-05 10.918 < 2e-16 ***
## MALE 5.889e-04 4.573e-05 12.880 < 2e-16 ***
## AGE_30_TO_39 1.508e-04 2.218e-05 6.798 1.06e-11 ***
## AGE_40_TO_49 5.968e-04 3.171e-05 18.817 < 2e-16 ***
## TTL_AGE_3_PLUS_ENRSTATUS -1.584e-03 1.064e-04 -14.884 < 2e-16 ***
## NOT_ENROLLED 2.776e-04 3.802e-05 7.302 2.84e-13 ***
## TTLPOP_25PLUS_EDU -2.130e-04 2.843e-05 -7.494 6.70e-14 ***
## TTLPOP_5PLUS_LNG 1.197e-03 1.088e-04 11.006 < 2e-16 ***
## HH_INC_125000_149999 -3.676e-04 7.424e-05 -4.951 7.37e-07 ***
## NATIVE -2.489e-05 1.155e-05 -2.155 0.0311 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4872.0 on 42 degrees of freedom
## Residual deviance: 1077.5 on 32 degrees of freedom
## AIC: 1442.8
##
## Number of Fisher Scoring iterations: 4
Un problème apparaît, notre modèle ne sélectionne pas de variables, il les garde toutes. Il semblerait que nous soyons en présence de surdispersion. Nous nous inspirons donc des codes de ce forum https://stats.stackexchange.com/questions/66586/is-there-a-test-to-determine-whether-glm-overdispersion-is-significant pour tester s’il y a présence ou non de surdispersion.
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
fit <- glm(nb_burglary~., data = Mat_crime, family="poisson")
dispersiontest(fit, trafo = 1)
##
## Overdispersion test
##
## data: fit
## z = 4.5648, p-value = 2.5e-06
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 23.97375
D’après le test nous nous trouvons bien en situation de surdispersion. Nous changeons donc notre modèle Poisson pour le modèle Quasi-Poisson qui est adapté pour ce type de modèle.
library(kableExtra)
fit.surdisp <- glm(nb_burglary~., data = Mat_crime,
family = "quasipoisson")
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
##
## Call:
## glm(formula = nb_burglary ~ ., family = "quasipoisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.033 -3.116 -0.879 3.613 10.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.735e+00 1.061e-01 54.035 < 2e-16 ***
## TWO_OR_MORE 5.496e-04 2.916e-04 1.885 0.06856 .
## MALE 5.889e-04 2.649e-04 2.223 0.03337 *
## AGE_30_TO_39 1.508e-04 1.285e-04 1.174 0.24925
## AGE_40_TO_49 5.968e-04 1.837e-04 3.248 0.00273 **
## TTL_AGE_3_PLUS_ENRSTATUS -1.584e-03 6.163e-04 -2.570 0.01505 *
## NOT_ENROLLED 2.776e-04 2.202e-04 1.261 0.21658
## TTLPOP_25PLUS_EDU -2.130e-04 1.647e-04 -1.294 0.20504
## TTLPOP_5PLUS_LNG 1.197e-03 6.302e-04 1.900 0.06648 .
## HH_INC_125000_149999 -3.676e-04 4.300e-04 -0.855 0.39903
## NATIVE -2.489e-05 6.688e-05 -0.372 0.71227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 33.55381)
##
## Null deviance: 4872.0 on 42 degrees of freedom
## Residual deviance: 1077.5 on 32 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson |
---|
(Intercept) |
MALE |
AGE_40_TO_49 |
TTL_AGE_3_PLUS_ENRSTATUS |
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
TWO_OR_MORE |
AGE_30_TO_39 |
HH_INC_125000_149999 |
Les cambriolages à Denver semblent être expliqués par des variables socio-économiques comme :
Le Total de le population âgé de 3 ans et plus étant scolarisé
Le Revenu par foyer se situant entre 125.000 et 149.999 dollars
#Sélection de variables par corrélation avec la variable nb_arson
library(MASS)
library(knitr)
library(kableExtra)
Mat_crime <- df_survey[,M]
Mat_crime$nb_arson <- df_survey$nb_arson
reg <- lm(nb_arson~., data = Mat_crime)
summary(reg)
##
## Call:
## lm(formula = nb_arson ~ ., data = Mat_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1102 -3.4578 -0.3942 2.4378 11.9601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.690932 2.252036 1.195 0.241192
## HISPANIC_OR_LATINO -0.003993 0.001807 -2.209 0.034659 *
## AGE_10_TO_14 -0.005283 0.009250 -0.571 0.572006
## AGE_10_TO_19 0.003352 0.006802 0.493 0.625585
## LESS_THAN_HS_DIPLOMA_EDU 0.025127 0.005956 4.218 0.000198 ***
## SPANISH_LNG 0.006914 0.005345 1.294 0.205378
## OTHER_FAMILY 0.001943 0.011626 0.167 0.868345
## FEMALE_HHLDR_NO_HSBND_PRESENT -0.009380 0.013984 -0.671 0.507320
## FOREIGN_BORN_FB 0.004320 0.002780 1.554 0.130371
## AMERICAS_FB 0.003904 0.039634 0.098 0.922174
## LATIN_AMERICA_FB -0.031823 0.040711 -0.782 0.440336
## CENTRAL_AMERICA_FB 0.007110 0.015247 0.466 0.644264
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.828 on 31 degrees of freedom
## Multiple R-squared: 0.7787, Adjusted R-squared: 0.7002
## F-statistic: 9.919 on 11 and 31 DF, p-value: 2.206e-07
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
kable(affichage,col.names = " Variables sélectionnées par critère AIC")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par critère AIC |
---|
HISPANIC_OR_LATINO |
LESS_THAN_HS_DIPLOMA_EDU |
LATIN_AMERICA_FB |
library(plotly)
library(knitr)
n <- sum(df_survey$nb_arson)
p <- plot_ly(x = ~df_survey$nb_arson,type = "histogram")
p
p <- plot_ly(x = ~rpois(n,lambda = 2), type = "histogram")
p
library(glmnet)
library(kableExtra)
lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
LATIN_AMERICA_FB |
glm <- glm(nb_arson~., data = Mat_crime, family = "poisson")
summary(glm)
##
## Call:
## glm(formula = nb_arson ~ ., family = "poisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1029 -1.1971 -0.1717 0.8896 2.8558
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.749e+00 1.238e-01 14.127 < 2e-16 ***
## HISPANIC_OR_LATINO -1.061e-04 8.568e-05 -1.238 0.2156
## AGE_10_TO_14 -1.473e-04 4.572e-04 -0.322 0.7474
## AGE_10_TO_19 5.358e-05 3.322e-04 0.161 0.8719
## LESS_THAN_HS_DIPLOMA_EDU 1.561e-03 2.420e-04 6.450 1.12e-10 ***
## SPANISH_LNG 1.668e-05 2.505e-04 0.067 0.9469
## OTHER_FAMILY 1.333e-04 6.095e-04 0.219 0.8269
## FEMALE_HHLDR_NO_HSBND_PRESENT -4.285e-04 7.188e-04 -0.596 0.5511
## FOREIGN_BORN_FB 2.685e-04 1.316e-04 2.041 0.0413 *
## AMERICAS_FB 4.448e-04 2.122e-03 0.210 0.8340
## LATIN_AMERICA_FB -1.525e-03 2.212e-03 -0.689 0.4906
## CENTRAL_AMERICA_FB 1.450e-04 6.851e-04 0.212 0.8324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 291.953 on 42 degrees of freedom
## Residual deviance: 93.148 on 31 degrees of freedom
## AIC: 297.15
##
## Number of Fisher Scoring iterations: 5
library(AER)
fit <- glm(nb_arson~., data = Mat_crime, family="poisson")
dispersiontest(fit, trafo = 1)
##
## Overdispersion test
##
## data: fit
## z = 2.9727, p-value = 0.001476
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 1.149056
library(kableExtra)
fit.surdisp <- glm(nb_arson~., data = Mat_crime,
family = "quasipoisson")
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
##
## Call:
## glm(formula = nb_arson ~ ., family = "quasipoisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1029 -1.1971 -0.1717 0.8896 2.8558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.749e+00 2.131e-01 8.205 2.88e-09 ***
## HISPANIC_OR_LATINO -1.061e-04 1.475e-04 -0.719 0.477380
## AGE_10_TO_14 -1.473e-04 7.873e-04 -0.187 0.852841
## AGE_10_TO_19 5.358e-05 5.719e-04 0.094 0.925967
## LESS_THAN_HS_DIPLOMA_EDU 1.561e-03 4.167e-04 3.746 0.000736 ***
## SPANISH_LNG 1.668e-05 4.314e-04 0.039 0.969410
## OTHER_FAMILY 1.333e-04 1.049e-03 0.127 0.899753
## FEMALE_HHLDR_NO_HSBND_PRESENT -4.285e-04 1.238e-03 -0.346 0.731495
## FOREIGN_BORN_FB 2.685e-04 2.266e-04 1.185 0.244914
## AMERICAS_FB 4.448e-04 3.653e-03 0.122 0.903879
## LATIN_AMERICA_FB -1.525e-03 3.808e-03 -0.400 0.691635
## CENTRAL_AMERICA_FB 1.450e-04 1.180e-03 0.123 0.902968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.964458)
##
## Null deviance: 291.953 on 42 degrees of freedom
## Residual deviance: 93.148 on 31 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson |
---|
(Intercept) |
LESS_THAN_HS_DIPLOMA_EDU |
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
LATIN_AMERICA_FB |
#Sélection de variables par corrélation avec la variable nb_auto-theft
library(MASS)
library(knitr)
library(kableExtra)
Mat_crime <- df_survey[,M]
Mat_crime$`nb_auto-theft` <- df_survey$`nb_auto-theft`
reg <- lm(`nb_auto-theft`~., data = Mat_crime)
summary(reg)
##
## Call:
## lm(formula = `nb_auto-theft` ~ ., data = Mat_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -278.77 -89.60 -12.02 91.64 288.62
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.22930 55.26489 2.845 0.00768 **
## TWO_OR_MORE 0.13533 0.15772 0.858 0.39724
## MALE 0.04530 0.52047 0.087 0.93118
## FEMALE -0.26984 0.52522 -0.514 0.61095
## AGE_LESS_5 0.26165 0.36900 0.709 0.48341
## AGE_30_TO_39 0.16505 0.06955 2.373 0.02380 *
## TTL_AGE_3_PLUS_ENRSTATUS -0.02396 0.48967 -0.049 0.96127
## NOT_ENROLLED 0.35966 0.13812 2.604 0.01386 *
## TTLPOP_25PLUS_EDU -0.24051 0.10199 -2.358 0.02464 *
## TTLPOP_5PLUS_LNG NA NA NA NA
## OTHER_FAMILY 0.23443 0.29791 0.787 0.43712
## FEMALE_HHLDR_NO_HSBND_PRESENT 0.22556 0.37381 0.603 0.55049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.8 on 32 degrees of freedom
## Multiple R-squared: 0.8093, Adjusted R-squared: 0.7497
## F-statistic: 13.58 on 10 and 32 DF, p-value: 6.742e-09
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
kable(affichage,col.names = " Variables sélectionnées par critère AIC")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par critère AIC |
---|
(Intercept) |
FEMALE |
AGE_LESS_5 |
AGE_30_TO_39 |
NOT_ENROLLED |
TTLPOP_25PLUS_EDU |
OTHER_FAMILY |
library(plotly)
library(knitr)
n <- sum(df_survey$`nb_auto-theft`)
p <- plot_ly(x = ~df_survey$`nb_auto-theft`,type = "histogram")
p
p <- plot_ly(x = ~rpois(n,lambda = 3), type = "histogram")
p
library(glmnet)
library(kableExtra)
lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
OTHER_FAMILY |
glm <- glm(`nb_auto-theft`~., data = Mat_crime, family = "poisson")
summary(glm)
##
## Call:
## glm(formula = `nb_auto-theft` ~ ., family = "poisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.1645 -3.6706 -0.9488 3.5901 13.9177
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.523e+00 1.912e-02 288.780 < 2e-16 ***
## TWO_OR_MORE -4.587e-05 4.348e-05 -1.055 0.2915
## MALE -3.425e-04 1.642e-04 -2.087 0.0369 *
## FEMALE -9.360e-04 1.616e-04 -5.792 6.97e-09 ***
## AGE_LESS_5 8.691e-04 1.114e-04 7.803 6.04e-15 ***
## AGE_30_TO_39 3.009e-04 2.234e-05 13.472 < 2e-16 ***
## TTL_AGE_3_PLUS_ENRSTATUS 3.186e-04 1.517e-04 2.100 0.0357 *
## NOT_ENROLLED 6.326e-04 4.125e-05 15.336 < 2e-16 ***
## TTLPOP_25PLUS_EDU -3.403e-04 2.992e-05 -11.373 < 2e-16 ***
## TTLPOP_5PLUS_LNG NA NA NA NA
## OTHER_FAMILY 8.465e-04 9.599e-05 8.818 < 2e-16 ***
## FEMALE_HHLDR_NO_HSBND_PRESENT 1.173e-04 1.173e-04 1.000 0.3172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5975.0 on 42 degrees of freedom
## Residual deviance: 1485.1 on 32 degrees of freedom
## AIC: 1849.8
##
## Number of Fisher Scoring iterations: 4
library(AER)
fit <- glm(`nb_auto-theft`~., data = Mat_crime, family="poisson")
dispersiontest(fit, trafo = 1)
##
## Overdispersion test
##
## data: fit
## z = 4.1125, p-value = 1.957e-05
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 32.63091
library(kableExtra)
fit.surdisp <- glm(`nb_auto-theft`~., data = Mat_crime,
family = "quasipoisson")
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
##
## Call:
## glm(formula = `nb_auto-theft` ~ ., family = "quasipoisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.1645 -3.6706 -0.9488 3.5901 13.9177
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.523e+00 1.286e-01 42.960 <2e-16 ***
## TWO_OR_MORE -4.587e-05 2.923e-04 -0.157 0.8763
## MALE -3.425e-04 1.103e-03 -0.310 0.7583
## FEMALE -9.360e-04 1.086e-03 -0.862 0.3953
## AGE_LESS_5 8.691e-04 7.487e-04 1.161 0.2543
## AGE_30_TO_39 3.009e-04 1.502e-04 2.004 0.0536 .
## TTL_AGE_3_PLUS_ENRSTATUS 3.186e-04 1.019e-03 0.312 0.7567
## NOT_ENROLLED 6.326e-04 2.773e-04 2.281 0.0293 *
## TTLPOP_25PLUS_EDU -3.403e-04 2.011e-04 -1.692 0.1004
## TTLPOP_5PLUS_LNG NA NA NA NA
## OTHER_FAMILY 8.465e-04 6.453e-04 1.312 0.1989
## FEMALE_HHLDR_NO_HSBND_PRESENT 1.173e-04 7.885e-04 0.149 0.8826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 45.18634)
##
## Null deviance: 5975.0 on 42 degrees of freedom
## Residual deviance: 1485.1 on 32 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson |
---|
(Intercept) |
NOT_ENROLLED |
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
OTHER_FAMILY |
#Sélection de variables par corrélation avec la variable nb_theft-from-motor-vehicle
library(MASS)
library(knitr)
library(kableExtra)
Mat_crime <- df_survey[,M]
Mat_crime$`nb_theft-from-motor-vehicle` <- df_survey$`nb_theft-from-motor-vehicle`
reg <- lm(`nb_theft-from-motor-vehicle`~., data = Mat_crime)
summary(reg)
##
## Call:
## lm(formula = `nb_theft-from-motor-vehicle` ~ ., data = Mat_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -343.55 -152.05 17.75 155.09 415.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 239.948159 85.951583 2.792 0.00890 **
## TWO_OR_MORE 0.834893 0.267186 3.125 0.00384 **
## MALE 0.005720 0.153673 0.037 0.97055
## AGE_20_TO_29 0.007201 0.093493 0.077 0.93910
## AGE_30_TO_39 0.009423 0.109481 0.086 0.93197
## NOT_ENROLLED 0.356429 0.231742 1.538 0.13418
## COMMUTE_LESS_15 0.292148 0.207469 1.408 0.16904
## TTLPOP_25PLUS_EDU -0.474503 0.234439 -2.024 0.05166 .
## ONLY_ENGLISH_LNG -0.044782 0.050868 -0.880 0.38544
## TTL_HOUSING_UNITS 0.139664 0.097156 1.438 0.16059
## HH_INC_125000_149999 1.748683 0.515635 3.391 0.00191 **
## BUILT_2000_2009 0.007187 0.070519 0.102 0.91948
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 223.5 on 31 degrees of freedom
## Multiple R-squared: 0.8694, Adjusted R-squared: 0.823
## F-statistic: 18.76 on 11 and 31 DF, p-value: 9.885e-11
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
kable(affichage,col.names = " Variables sélectionnées par critère AIC")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par critère AIC |
---|
(Intercept) |
TWO_OR_MORE |
NOT_ENROLLED |
COMMUTE_LESS_15 |
TTLPOP_25PLUS_EDU |
TTL_HOUSING_UNITS |
HH_INC_125000_149999 |
library(plotly)
library(knitr)
n <- sum(df_survey$`nb_theft-from-motor-vehicle`)
p <- plot_ly(x = ~df_survey$`nb_theft-from-motor-vehicle`,type = "histogram")
p
p <- plot_ly(x = ~rpois(n,lambda = 1.6), type = "histogram")
p
library(glmnet)
library(kableExtra)
lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
AGE_30_TO_39 |
HH_INC_125000_149999 |
glm <- glm(`nb_theft-from-motor-vehicle`~., data = Mat_crime, family = "poisson")
summary(glm)
##
## Call:
## glm(formula = `nb_theft-from-motor-vehicle` ~ ., family = "poisson",
## data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -15.3857 -4.5671 0.0845 3.3067 11.4538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.912e+00 1.526e-02 387.394 < 2e-16 ***
## TWO_OR_MORE 4.333e-04 4.063e-05 10.666 < 2e-16 ***
## MALE 1.426e-04 2.495e-05 5.717 1.08e-08 ***
## AGE_20_TO_29 -1.702e-04 1.402e-05 -12.147 < 2e-16 ***
## AGE_30_TO_39 7.640e-05 1.703e-05 4.485 7.28e-06 ***
## NOT_ENROLLED 2.920e-04 3.728e-05 7.832 4.81e-15 ***
## COMMUTE_LESS_15 4.215e-04 3.460e-05 12.183 < 2e-16 ***
## TTLPOP_25PLUS_EDU -5.498e-04 3.885e-05 -14.151 < 2e-16 ***
## ONLY_ENGLISH_LNG -8.761e-06 9.041e-06 -0.969 0.333
## TTL_HOUSING_UNITS 2.438e-04 1.554e-05 15.692 < 2e-16 ***
## HH_INC_125000_149999 1.308e-03 8.574e-05 15.253 < 2e-16 ***
## BUILT_2000_2009 -4.229e-05 1.056e-05 -4.004 6.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 11489.5 on 42 degrees of freedom
## Residual deviance: 1775.6 on 31 degrees of freedom
## AIC: 2158.9
##
## Number of Fisher Scoring iterations: 4
library(AER)
fit <- glm(`nb_theft-from-motor-vehicle`~., data = Mat_crime, family="poisson")
dispersiontest(fit, trafo = 1)
##
## Overdispersion test
##
## data: fit
## z = 5.2666, p-value = 6.948e-08
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 39.30164
library(kableExtra)
fit.surdisp <- glm(`nb_theft-from-motor-vehicle`~., data = Mat_crime,
family = "quasipoisson")
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
##
## Call:
## glm(formula = `nb_theft-from-motor-vehicle` ~ ., family = "quasipoisson",
## data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -15.3857 -4.5671 0.0845 3.3067 11.4538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.912e+00 1.141e-01 51.813 <2e-16 ***
## TWO_OR_MORE 4.333e-04 3.038e-04 1.427 0.1637
## MALE 1.426e-04 1.865e-04 0.765 0.4503
## AGE_20_TO_29 -1.702e-04 1.048e-04 -1.625 0.1143
## AGE_30_TO_39 7.640e-05 1.274e-04 0.600 0.5529
## NOT_ENROLLED 2.920e-04 2.787e-04 1.047 0.3030
## COMMUTE_LESS_15 4.215e-04 2.587e-04 1.629 0.1133
## TTLPOP_25PLUS_EDU -5.498e-04 2.905e-04 -1.893 0.0678 .
## ONLY_ENGLISH_LNG -8.761e-06 6.760e-05 -0.130 0.8977
## TTL_HOUSING_UNITS 2.438e-04 1.162e-04 2.099 0.0441 *
## HH_INC_125000_149999 1.308e-03 6.410e-04 2.040 0.0499 *
## BUILT_2000_2009 -4.229e-05 7.897e-05 -0.535 0.5961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 55.90103)
##
## Null deviance: 11489.5 on 42 degrees of freedom
## Residual deviance: 1775.6 on 31 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson |
---|
(Intercept) |
TTL_HOUSING_UNITS |
HH_INC_125000_149999 |
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
AGE_30_TO_39 |
HH_INC_125000_149999 |
#Sélection de variables par corrélation avec la variable nb_larceny
library(MASS)
library(knitr)
library(kableExtra)
Mat_crime <- df_survey[,M]
Mat_crime$nb_larceny <- df_survey$nb_larceny
reg <- lm(nb_larceny~., data = Mat_crime)
summary(reg)
##
## Call:
## lm(formula = nb_larceny ~ ., data = Mat_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1363.03 -281.68 45.14 353.54 1271.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -250.40330 197.70486 -1.267 0.21475
## WHITE 0.03021 0.12830 0.235 0.81537
## AGE_30_TO_39 0.01165 0.17209 0.068 0.94648
## GRADUATE_SCHOOL 2.06803 1.47863 1.399 0.17185
## BACHELORS_OR_HIGHER_EDU -0.17379 0.40439 -0.430 0.67035
## RENTER_OCCUPIED_HU 0.41564 0.30155 1.378 0.17796
## NONFAMILY_HOUSEHOLD 0.08172 0.32963 0.248 0.80583
## HH_INC_125000_149999 -0.51607 1.16210 -0.444 0.66007
## HH_INC_150000_199999 -3.30758 1.60624 -2.059 0.04796 *
## HH_INC_OVER_200000 3.04069 1.07471 2.829 0.00811 **
## BUILT_2014_OR_LATER 0.73084 1.27046 0.575 0.56927
## BUILT_2010_2013 1.45262 0.87755 1.655 0.10795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 580.6 on 31 degrees of freedom
## Multiple R-squared: 0.8464, Adjusted R-squared: 0.792
## F-statistic: 15.54 on 11 and 31 DF, p-value: 1.084e-09
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
kable(affichage,col.names = "Variables sélectionnées par critère AIC")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par critère AIC |
---|
RENTER_OCCUPIED_HU |
HH_INC_150000_199999 |
HH_INC_OVER_200000 |
BUILT_2010_2013 |
library(plotly)
library(knitr)
n <- sum(df_survey$nb_larceny)
p <- plot_ly(x = ~df_survey$nb_larceny,type = "histogram")
p
p <- plot_ly(x = ~rpois(n,lambda = 0.5), type = "histogram")
p
library(glmnet)
library(kableExtra)
lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
HH_INC_125000_149999 |
BUILT_2014_OR_LATER |
glm <- glm(nb_larceny~., data = Mat_crime, family = "poisson")
summary(glm)
##
## Call:
## glm(formula = nb_larceny ~ ., family = "poisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.336 -5.919 -2.386 4.258 29.577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.738e+00 1.389e-02 413.196 < 2e-16 ***
## WHITE 4.156e-06 8.918e-06 0.466 0.64120
## AGE_30_TO_39 6.950e-05 9.014e-06 7.711 1.25e-14 ***
## GRADUATE_SCHOOL 1.101e-03 9.328e-05 11.798 < 2e-16 ***
## BACHELORS_OR_HIGHER_EDU -3.947e-04 2.795e-05 -14.120 < 2e-16 ***
## RENTER_OCCUPIED_HU 3.299e-04 1.701e-05 19.395 < 2e-16 ***
## NONFAMILY_HOUSEHOLD 1.912e-04 1.873e-05 10.207 < 2e-16 ***
## HH_INC_125000_149999 -2.578e-04 6.217e-05 -4.146 3.38e-05 ***
## HH_INC_150000_199999 -2.608e-04 9.636e-05 -2.707 0.00679 **
## HH_INC_OVER_200000 2.290e-03 6.668e-05 34.344 < 2e-16 ***
## BUILT_2014_OR_LATER 2.459e-04 7.779e-05 3.162 0.00157 **
## BUILT_2010_2013 4.837e-04 4.951e-05 9.770 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 40709.2 on 42 degrees of freedom
## Residual deviance: 4141.9 on 31 degrees of freedom
## AIC: 4525.9
##
## Number of Fisher Scoring iterations: 5
library(AER)
fit <- glm(nb_larceny~., data = Mat_crime, family="poisson")
dispersiontest(fit, trafo = 1)
##
## Overdispersion test
##
## data: fit
## z = 2.9528, p-value = 0.001575
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 102.2864
library(kableExtra)
fit.surdisp <- glm(nb_larceny~., data = Mat_crime,
family = "quasipoisson")
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
##
## Call:
## glm(formula = nb_larceny ~ ., family = "quasipoisson", data = Mat_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.336 -5.919 -2.386 4.258 29.577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.738e+00 1.662e-01 34.519 < 2e-16 ***
## WHITE 4.156e-06 1.067e-04 0.039 0.96919
## AGE_30_TO_39 6.950e-05 1.079e-04 0.644 0.52421
## GRADUATE_SCHOOL 1.101e-03 1.117e-03 0.986 0.33197
## BACHELORS_OR_HIGHER_EDU -3.947e-04 3.346e-04 -1.180 0.24714
## RENTER_OCCUPIED_HU 3.299e-04 2.036e-04 1.620 0.11530
## NONFAMILY_HOUSEHOLD 1.912e-04 2.242e-04 0.853 0.40038
## HH_INC_125000_149999 -2.578e-04 7.442e-04 -0.346 0.73141
## HH_INC_150000_199999 -2.608e-04 1.153e-03 -0.226 0.82257
## HH_INC_OVER_200000 2.290e-03 7.981e-04 2.869 0.00735 **
## BUILT_2014_OR_LATER 2.459e-04 9.311e-04 0.264 0.79342
## BUILT_2010_2013 4.837e-04 5.927e-04 0.816 0.42064
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 143.2854)
##
## Null deviance: 40709.2 on 42 degrees of freedom
## Residual deviance: 4141.9 on 31 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson |
---|
(Intercept) |
HH_INC_OVER_200000 |
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
kable_styling(bootstrap_options = c("striped", "hover"))%>%
scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso |
---|
HH_INC_125000_149999 |
BUILT_2014_OR_LATER |