Nous allons travailler sur la variable crimes contre la propriété regroupant les incendies volontaires, larcins, cambriolages, vols de voitures ainsi que les vols dans un véhicule à moteur.

Pour chacune, nous allons dans un premier temps réaliser une régression linéaire et éliminer certaines variables par critère AIC pour voir lesquelles contribuent au modèle. Puis, dans un second temps tester si chaque variable a une répartition de type Poisson. Ensuite, nous réaliserons une régression de type “Lasso” via le package glmnet. Enfin, si le modèle sélectionne trop de variables nous testerons si nous sommes en présence de surdispersion via la fonction dispersiontest. Enfin, nous récupérerons les variables séléctionnées par la régression Lasso “Poisson” et “Quasi-Poisson”.

Cambriolage

En régression linéaire il existe deux grands problèmes.

Soit \(R^2_{j}\) est le coefficient de détermination de la régression de \(x_{j}\) sur les (p-1) autres variables. Il peut exister un problème de colinéarité : \(R^2_{j} \approx 1\) implique que le facteur d’inflation de la variance \(v_{j}=\frac{1}{1-R^2} \approx +\infty\).

D’autre part, il est important de s’intéresser au SCR (somme des carrés des résidus) : indicateur de qualité de la régression, divisé par les degrés de liberté. Il peut exister un problème de dimensionnalité : \(p \approx n\) implique \(\hat{\sigma}_{\epsilon}^2 = \frac{\sum_{i=1}^n \hat{\epsilon}^2_{i}}{n-p} \approx +\infty\) ainsi, \((p>n)\) implique \((X'X)\) n’est pas inversible.

Ces problèmes entraînent une variance élevée de l’estimation c.-à-d. les coefficients estimés sont très erratiques, exagérèment dépendants de l’échantillon d’apprentissage.

Selection des 11 variables les plus corrélées avec la variable nb_burglary

Nous sélectionnons les variables suivantes :

M
##  [1] "TWO_OR_MORE"              "MALE"                    
##  [3] "AGE_30_TO_39"             "AGE_40_TO_49"            
##  [5] "TTL_AGE_3_PLUS_ENRSTATUS" "NOT_ENROLLED"            
##  [7] "TTLPOP_25PLUS_EDU"        "TTLPOP_5PLUS_LNG"        
##  [9] "HH_INC_125000_149999"     "NATIVE"

Nous sélectionnons les variables suivantes :

Nous voulons dans cette partie effectuer une régression et éliminer certaines variables par critère AIC pour voir lesquelles contribuent au modèle.

library(MASS)
library(knitr)
library(kableExtra)
Mat_crime <- df_survey[,M]
Mat_crime$nb_burglary <- df_survey$nb_burglary
reg <- lm(nb_burglary~., data = Mat_crime)
summary(reg)
## 
## Call:
## lm(formula = nb_burglary ~ ., data = Mat_crime)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -290.439  -71.327   -6.218   68.841  220.751 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              250.63794   46.94089   5.339  7.4e-06 ***
## TWO_OR_MORE                0.41145    0.14148   2.908  0.00656 ** 
## MALE                       0.26416    0.11846   2.230  0.03290 *  
## AGE_30_TO_39               0.09930    0.06004   1.654  0.10794    
## AGE_40_TO_49               0.28414    0.08981   3.164  0.00340 ** 
## TTL_AGE_3_PLUS_ENRSTATUS  -0.69417    0.29844  -2.326  0.02651 *  
## NOT_ENROLLED               0.21136    0.11468   1.843  0.07460 .  
## TTLPOP_25PLUS_EDU         -0.18458    0.08965  -2.059  0.04772 *  
## TTLPOP_5PLUS_LNG           0.50880    0.30041   1.694  0.10004    
## HH_INC_125000_149999       0.07142    0.22695   0.315  0.75505    
## NATIVE                    -0.01223    0.03455  -0.354  0.72568    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.3 on 32 degrees of freedom
## Multiple R-squared:  0.8295, Adjusted R-squared:  0.7762 
## F-statistic: 15.56 on 10 and 32 DF,  p-value: 1.234e-09
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])

kable(affichage,col.names = "Variables sélectionnées par critère AIC")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par critère AIC
(Intercept)
TWO_OR_MORE
MALE
AGE_40_TO_49
TTL_AGE_3_PLUS_ENRSTATUS
TTLPOP_25PLUS_EDU

Ici, le \(R^2=0.8295\) ce qui est proche de 1. Ainsi, nous devons faire attention à une possible colinéarité dans nos données.

Les variables sélectionnées par critère AIC sont :

Testons si chaque variable a une répartition de type Poisson.

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(knitr)
n <- sum(df_survey$nb_burglary)
p <- plot_ly(x = ~df_survey$nb_burglary,type = "histogram") 
p
p <- plot_ly(x = ~rpois(n,lambda = 1), type = "histogram")
p

Il semblerait que la variable nombre de cambriolages ait une répartition de type poisson avec un lambda proche de 1.

Ensuite, nous réalisons une régression de type “Lasso” via le package glmnet.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library(kableExtra)

lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
                y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
                family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")

kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
TWO_OR_MORE
AGE_30_TO_39
HH_INC_125000_149999

Les variables sélectionnées par la régression Lasso sont :

glm <- glm(nb_burglary~., data = Mat_crime, family = "poisson")
summary(glm)
## 
## Call:
## glm(formula = nb_burglary ~ ., family = "poisson", data = Mat_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -13.033   -3.116   -0.879    3.613   10.391  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.735e+00  1.832e-02 313.002  < 2e-16 ***
## TWO_OR_MORE               5.496e-04  5.034e-05  10.918  < 2e-16 ***
## MALE                      5.889e-04  4.573e-05  12.880  < 2e-16 ***
## AGE_30_TO_39              1.508e-04  2.218e-05   6.798 1.06e-11 ***
## AGE_40_TO_49              5.968e-04  3.171e-05  18.817  < 2e-16 ***
## TTL_AGE_3_PLUS_ENRSTATUS -1.584e-03  1.064e-04 -14.884  < 2e-16 ***
## NOT_ENROLLED              2.776e-04  3.802e-05   7.302 2.84e-13 ***
## TTLPOP_25PLUS_EDU        -2.130e-04  2.843e-05  -7.494 6.70e-14 ***
## TTLPOP_5PLUS_LNG          1.197e-03  1.088e-04  11.006  < 2e-16 ***
## HH_INC_125000_149999     -3.676e-04  7.424e-05  -4.951 7.37e-07 ***
## NATIVE                   -2.489e-05  1.155e-05  -2.155   0.0311 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4872.0  on 42  degrees of freedom
## Residual deviance: 1077.5  on 32  degrees of freedom
## AIC: 1442.8
## 
## Number of Fisher Scoring iterations: 4

Un problème apparaît, notre modèle ne sélectionne pas de variables, il les garde toutes. Il semblerait que nous soyons en présence de surdispersion. Nous nous inspirons donc des codes de ce forum https://stats.stackexchange.com/questions/66586/is-there-a-test-to-determine-whether-glm-overdispersion-is-significant pour tester s’il y a présence ou non de surdispersion.

library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
fit <- glm(nb_burglary~., data = Mat_crime, family="poisson") 
dispersiontest(fit, trafo = 1)
## 
##  Overdispersion test
## 
## data:  fit
## z = 4.5648, p-value = 2.5e-06
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 23.97375

D’après le test nous nous trouvons bien en situation de surdispersion. Nous changeons donc notre modèle Poisson pour le modèle Quasi-Poisson qui est adapté pour ce type de modèle.

library(kableExtra)

fit.surdisp <- glm(nb_burglary~., data = Mat_crime,
                   family = "quasipoisson") 
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
## 
## Call:
## glm(formula = nb_burglary ~ ., family = "quasipoisson", data = Mat_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -13.033   -3.116   -0.879    3.613   10.391  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5.735e+00  1.061e-01  54.035  < 2e-16 ***
## TWO_OR_MORE               5.496e-04  2.916e-04   1.885  0.06856 .  
## MALE                      5.889e-04  2.649e-04   2.223  0.03337 *  
## AGE_30_TO_39              1.508e-04  1.285e-04   1.174  0.24925    
## AGE_40_TO_49              5.968e-04  1.837e-04   3.248  0.00273 ** 
## TTL_AGE_3_PLUS_ENRSTATUS -1.584e-03  6.163e-04  -2.570  0.01505 *  
## NOT_ENROLLED              2.776e-04  2.202e-04   1.261  0.21658    
## TTLPOP_25PLUS_EDU        -2.130e-04  1.647e-04  -1.294  0.20504    
## TTLPOP_5PLUS_LNG          1.197e-03  6.302e-04   1.900  0.06648 .  
## HH_INC_125000_149999     -3.676e-04  4.300e-04  -0.855  0.39903    
## NATIVE                   -2.489e-05  6.688e-05  -0.372  0.71227    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 33.55381)
## 
##     Null deviance: 4872.0  on 42  degrees of freedom
## Residual deviance: 1077.5  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson
(Intercept)
MALE
AGE_40_TO_49
TTL_AGE_3_PLUS_ENRSTATUS
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
TWO_OR_MORE
AGE_30_TO_39
HH_INC_125000_149999

Les cambriolages à Denver semblent être expliqués par des variables socio-économiques comme :

Incendie volontaire

#Sélection de variables par corrélation avec la variable nb_arson

library(MASS)
library(knitr)
library(kableExtra)

  

Mat_crime <- df_survey[,M]
Mat_crime$nb_arson <- df_survey$nb_arson

reg <- lm(nb_arson~., data = Mat_crime)
summary(reg)
## 
## Call:
## lm(formula = nb_arson ~ ., data = Mat_crime)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1102  -3.4578  -0.3942   2.4378  11.9601 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.690932   2.252036   1.195 0.241192    
## HISPANIC_OR_LATINO            -0.003993   0.001807  -2.209 0.034659 *  
## AGE_10_TO_14                  -0.005283   0.009250  -0.571 0.572006    
## AGE_10_TO_19                   0.003352   0.006802   0.493 0.625585    
## LESS_THAN_HS_DIPLOMA_EDU       0.025127   0.005956   4.218 0.000198 ***
## SPANISH_LNG                    0.006914   0.005345   1.294 0.205378    
## OTHER_FAMILY                   0.001943   0.011626   0.167 0.868345    
## FEMALE_HHLDR_NO_HSBND_PRESENT -0.009380   0.013984  -0.671 0.507320    
## FOREIGN_BORN_FB                0.004320   0.002780   1.554 0.130371    
## AMERICAS_FB                    0.003904   0.039634   0.098 0.922174    
## LATIN_AMERICA_FB              -0.031823   0.040711  -0.782 0.440336    
## CENTRAL_AMERICA_FB             0.007110   0.015247   0.466 0.644264    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.828 on 31 degrees of freedom
## Multiple R-squared:  0.7787, Adjusted R-squared:  0.7002 
## F-statistic: 9.919 on 11 and 31 DF,  p-value: 2.206e-07
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])

kable(affichage,col.names = "  Variables sélectionnées par critère AIC")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par critère AIC
HISPANIC_OR_LATINO
LESS_THAN_HS_DIPLOMA_EDU
LATIN_AMERICA_FB
library(plotly)
library(knitr)
n <- sum(df_survey$nb_arson)
p <- plot_ly(x = ~df_survey$nb_arson,type = "histogram") 
p
p <- plot_ly(x = ~rpois(n,lambda = 2), type = "histogram")
p
library(glmnet)
library(kableExtra)

lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
                y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
                family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")

kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
LATIN_AMERICA_FB
glm <- glm(nb_arson~., data = Mat_crime, family = "poisson")
summary(glm)
## 
## Call:
## glm(formula = nb_arson ~ ., family = "poisson", data = Mat_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1029  -1.1971  -0.1717   0.8896   2.8558  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.749e+00  1.238e-01  14.127  < 2e-16 ***
## HISPANIC_OR_LATINO            -1.061e-04  8.568e-05  -1.238   0.2156    
## AGE_10_TO_14                  -1.473e-04  4.572e-04  -0.322   0.7474    
## AGE_10_TO_19                   5.358e-05  3.322e-04   0.161   0.8719    
## LESS_THAN_HS_DIPLOMA_EDU       1.561e-03  2.420e-04   6.450 1.12e-10 ***
## SPANISH_LNG                    1.668e-05  2.505e-04   0.067   0.9469    
## OTHER_FAMILY                   1.333e-04  6.095e-04   0.219   0.8269    
## FEMALE_HHLDR_NO_HSBND_PRESENT -4.285e-04  7.188e-04  -0.596   0.5511    
## FOREIGN_BORN_FB                2.685e-04  1.316e-04   2.041   0.0413 *  
## AMERICAS_FB                    4.448e-04  2.122e-03   0.210   0.8340    
## LATIN_AMERICA_FB              -1.525e-03  2.212e-03  -0.689   0.4906    
## CENTRAL_AMERICA_FB             1.450e-04  6.851e-04   0.212   0.8324    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 291.953  on 42  degrees of freedom
## Residual deviance:  93.148  on 31  degrees of freedom
## AIC: 297.15
## 
## Number of Fisher Scoring iterations: 5
library(AER)
fit <- glm(nb_arson~., data = Mat_crime, family="poisson") 
dispersiontest(fit, trafo = 1)
## 
##  Overdispersion test
## 
## data:  fit
## z = 2.9727, p-value = 0.001476
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 1.149056
library(kableExtra)

fit.surdisp <- glm(nb_arson~., data = Mat_crime,
                   family = "quasipoisson") 
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
## 
## Call:
## glm(formula = nb_arson ~ ., family = "quasipoisson", data = Mat_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1029  -1.1971  -0.1717   0.8896   2.8558  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.749e+00  2.131e-01   8.205 2.88e-09 ***
## HISPANIC_OR_LATINO            -1.061e-04  1.475e-04  -0.719 0.477380    
## AGE_10_TO_14                  -1.473e-04  7.873e-04  -0.187 0.852841    
## AGE_10_TO_19                   5.358e-05  5.719e-04   0.094 0.925967    
## LESS_THAN_HS_DIPLOMA_EDU       1.561e-03  4.167e-04   3.746 0.000736 ***
## SPANISH_LNG                    1.668e-05  4.314e-04   0.039 0.969410    
## OTHER_FAMILY                   1.333e-04  1.049e-03   0.127 0.899753    
## FEMALE_HHLDR_NO_HSBND_PRESENT -4.285e-04  1.238e-03  -0.346 0.731495    
## FOREIGN_BORN_FB                2.685e-04  2.266e-04   1.185 0.244914    
## AMERICAS_FB                    4.448e-04  3.653e-03   0.122 0.903879    
## LATIN_AMERICA_FB              -1.525e-03  3.808e-03  -0.400 0.691635    
## CENTRAL_AMERICA_FB             1.450e-04  1.180e-03   0.123 0.902968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.964458)
## 
##     Null deviance: 291.953  on 42  degrees of freedom
## Residual deviance:  93.148  on 31  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson
(Intercept)
LESS_THAN_HS_DIPLOMA_EDU
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
LATIN_AMERICA_FB

Vol d’automobile

#Sélection de variables par corrélation avec la variable nb_auto-theft

library(MASS)
library(knitr)
library(kableExtra)

  

Mat_crime <- df_survey[,M]
Mat_crime$`nb_auto-theft` <- df_survey$`nb_auto-theft`

reg <- lm(`nb_auto-theft`~., data = Mat_crime)
summary(reg)
## 
## Call:
## lm(formula = `nb_auto-theft` ~ ., data = Mat_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -278.77  -89.60  -12.02   91.64  288.62 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                   157.22930   55.26489   2.845  0.00768 **
## TWO_OR_MORE                     0.13533    0.15772   0.858  0.39724   
## MALE                            0.04530    0.52047   0.087  0.93118   
## FEMALE                         -0.26984    0.52522  -0.514  0.61095   
## AGE_LESS_5                      0.26165    0.36900   0.709  0.48341   
## AGE_30_TO_39                    0.16505    0.06955   2.373  0.02380 * 
## TTL_AGE_3_PLUS_ENRSTATUS       -0.02396    0.48967  -0.049  0.96127   
## NOT_ENROLLED                    0.35966    0.13812   2.604  0.01386 * 
## TTLPOP_25PLUS_EDU              -0.24051    0.10199  -2.358  0.02464 * 
## TTLPOP_5PLUS_LNG                     NA         NA      NA       NA   
## OTHER_FAMILY                    0.23443    0.29791   0.787  0.43712   
## FEMALE_HHLDR_NO_HSBND_PRESENT   0.22556    0.37381   0.603  0.55049   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.8 on 32 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.7497 
## F-statistic: 13.58 on 10 and 32 DF,  p-value: 6.742e-09
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])

kable(affichage,col.names = "  Variables sélectionnées par critère AIC")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par critère AIC
(Intercept)
FEMALE
AGE_LESS_5
AGE_30_TO_39
NOT_ENROLLED
TTLPOP_25PLUS_EDU
OTHER_FAMILY
library(plotly)
library(knitr)
n <- sum(df_survey$`nb_auto-theft`)
p <- plot_ly(x = ~df_survey$`nb_auto-theft`,type = "histogram") 
p
p <- plot_ly(x = ~rpois(n,lambda = 3), type = "histogram")
p
library(glmnet)
library(kableExtra)

lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
                y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
                family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")

kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
OTHER_FAMILY
glm <- glm(`nb_auto-theft`~., data = Mat_crime, family = "poisson")
summary(glm)
## 
## Call:
## glm(formula = `nb_auto-theft` ~ ., family = "poisson", data = Mat_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -17.1645   -3.6706   -0.9488    3.5901   13.9177  
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    5.523e+00  1.912e-02 288.780  < 2e-16 ***
## TWO_OR_MORE                   -4.587e-05  4.348e-05  -1.055   0.2915    
## MALE                          -3.425e-04  1.642e-04  -2.087   0.0369 *  
## FEMALE                        -9.360e-04  1.616e-04  -5.792 6.97e-09 ***
## AGE_LESS_5                     8.691e-04  1.114e-04   7.803 6.04e-15 ***
## AGE_30_TO_39                   3.009e-04  2.234e-05  13.472  < 2e-16 ***
## TTL_AGE_3_PLUS_ENRSTATUS       3.186e-04  1.517e-04   2.100   0.0357 *  
## NOT_ENROLLED                   6.326e-04  4.125e-05  15.336  < 2e-16 ***
## TTLPOP_25PLUS_EDU             -3.403e-04  2.992e-05 -11.373  < 2e-16 ***
## TTLPOP_5PLUS_LNG                      NA         NA      NA       NA    
## OTHER_FAMILY                   8.465e-04  9.599e-05   8.818  < 2e-16 ***
## FEMALE_HHLDR_NO_HSBND_PRESENT  1.173e-04  1.173e-04   1.000   0.3172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5975.0  on 42  degrees of freedom
## Residual deviance: 1485.1  on 32  degrees of freedom
## AIC: 1849.8
## 
## Number of Fisher Scoring iterations: 4
library(AER)
fit <- glm(`nb_auto-theft`~., data = Mat_crime, family="poisson") 
dispersiontest(fit, trafo = 1)
## 
##  Overdispersion test
## 
## data:  fit
## z = 4.1125, p-value = 1.957e-05
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 32.63091
library(kableExtra)

fit.surdisp <- glm(`nb_auto-theft`~., data = Mat_crime,
                   family = "quasipoisson") 
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
## 
## Call:
## glm(formula = `nb_auto-theft` ~ ., family = "quasipoisson", data = Mat_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -17.1645   -3.6706   -0.9488    3.5901   13.9177  
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.523e+00  1.286e-01  42.960   <2e-16 ***
## TWO_OR_MORE                   -4.587e-05  2.923e-04  -0.157   0.8763    
## MALE                          -3.425e-04  1.103e-03  -0.310   0.7583    
## FEMALE                        -9.360e-04  1.086e-03  -0.862   0.3953    
## AGE_LESS_5                     8.691e-04  7.487e-04   1.161   0.2543    
## AGE_30_TO_39                   3.009e-04  1.502e-04   2.004   0.0536 .  
## TTL_AGE_3_PLUS_ENRSTATUS       3.186e-04  1.019e-03   0.312   0.7567    
## NOT_ENROLLED                   6.326e-04  2.773e-04   2.281   0.0293 *  
## TTLPOP_25PLUS_EDU             -3.403e-04  2.011e-04  -1.692   0.1004    
## TTLPOP_5PLUS_LNG                      NA         NA      NA       NA    
## OTHER_FAMILY                   8.465e-04  6.453e-04   1.312   0.1989    
## FEMALE_HHLDR_NO_HSBND_PRESENT  1.173e-04  7.885e-04   0.149   0.8826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 45.18634)
## 
##     Null deviance: 5975.0  on 42  degrees of freedom
## Residual deviance: 1485.1  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson
(Intercept)
NOT_ENROLLED
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
OTHER_FAMILY

Vol dans un véhicule à moteur

#Sélection de variables par corrélation avec la variable nb_theft-from-motor-vehicle

library(MASS)
library(knitr)
library(kableExtra)

  

Mat_crime <- df_survey[,M]
Mat_crime$`nb_theft-from-motor-vehicle` <- df_survey$`nb_theft-from-motor-vehicle`

reg <- lm(`nb_theft-from-motor-vehicle`~., data = Mat_crime)
summary(reg)
## 
## Call:
## lm(formula = `nb_theft-from-motor-vehicle` ~ ., data = Mat_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -343.55 -152.05   17.75  155.09  415.89 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          239.948159  85.951583   2.792  0.00890 **
## TWO_OR_MORE            0.834893   0.267186   3.125  0.00384 **
## MALE                   0.005720   0.153673   0.037  0.97055   
## AGE_20_TO_29           0.007201   0.093493   0.077  0.93910   
## AGE_30_TO_39           0.009423   0.109481   0.086  0.93197   
## NOT_ENROLLED           0.356429   0.231742   1.538  0.13418   
## COMMUTE_LESS_15        0.292148   0.207469   1.408  0.16904   
## TTLPOP_25PLUS_EDU     -0.474503   0.234439  -2.024  0.05166 . 
## ONLY_ENGLISH_LNG      -0.044782   0.050868  -0.880  0.38544   
## TTL_HOUSING_UNITS      0.139664   0.097156   1.438  0.16059   
## HH_INC_125000_149999   1.748683   0.515635   3.391  0.00191 **
## BUILT_2000_2009        0.007187   0.070519   0.102  0.91948   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 223.5 on 31 degrees of freedom
## Multiple R-squared:  0.8694, Adjusted R-squared:  0.823 
## F-statistic: 18.76 on 11 and 31 DF,  p-value: 9.885e-11
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])

kable(affichage,col.names = "  Variables sélectionnées par critère AIC")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par critère AIC
(Intercept)
TWO_OR_MORE
NOT_ENROLLED
COMMUTE_LESS_15
TTLPOP_25PLUS_EDU
TTL_HOUSING_UNITS
HH_INC_125000_149999
library(plotly)
library(knitr)
n <- sum(df_survey$`nb_theft-from-motor-vehicle`)
p <- plot_ly(x = ~df_survey$`nb_theft-from-motor-vehicle`,type = "histogram") 
p
p <- plot_ly(x = ~rpois(n,lambda = 1.6), type = "histogram")
p
library(glmnet)
library(kableExtra)

lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
                y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
                family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")

kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
AGE_30_TO_39
HH_INC_125000_149999
glm <- glm(`nb_theft-from-motor-vehicle`~., data = Mat_crime, family = "poisson")
summary(glm)
## 
## Call:
## glm(formula = `nb_theft-from-motor-vehicle` ~ ., family = "poisson", 
##     data = Mat_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -15.3857   -4.5671    0.0845    3.3067   11.4538  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           5.912e+00  1.526e-02 387.394  < 2e-16 ***
## TWO_OR_MORE           4.333e-04  4.063e-05  10.666  < 2e-16 ***
## MALE                  1.426e-04  2.495e-05   5.717 1.08e-08 ***
## AGE_20_TO_29         -1.702e-04  1.402e-05 -12.147  < 2e-16 ***
## AGE_30_TO_39          7.640e-05  1.703e-05   4.485 7.28e-06 ***
## NOT_ENROLLED          2.920e-04  3.728e-05   7.832 4.81e-15 ***
## COMMUTE_LESS_15       4.215e-04  3.460e-05  12.183  < 2e-16 ***
## TTLPOP_25PLUS_EDU    -5.498e-04  3.885e-05 -14.151  < 2e-16 ***
## ONLY_ENGLISH_LNG     -8.761e-06  9.041e-06  -0.969    0.333    
## TTL_HOUSING_UNITS     2.438e-04  1.554e-05  15.692  < 2e-16 ***
## HH_INC_125000_149999  1.308e-03  8.574e-05  15.253  < 2e-16 ***
## BUILT_2000_2009      -4.229e-05  1.056e-05  -4.004 6.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 11489.5  on 42  degrees of freedom
## Residual deviance:  1775.6  on 31  degrees of freedom
## AIC: 2158.9
## 
## Number of Fisher Scoring iterations: 4
library(AER)
fit <- glm(`nb_theft-from-motor-vehicle`~., data = Mat_crime, family="poisson") 
dispersiontest(fit, trafo = 1)
## 
##  Overdispersion test
## 
## data:  fit
## z = 5.2666, p-value = 6.948e-08
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 39.30164
library(kableExtra)

fit.surdisp <- glm(`nb_theft-from-motor-vehicle`~., data = Mat_crime,
                   family = "quasipoisson") 
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
## 
## Call:
## glm(formula = `nb_theft-from-motor-vehicle` ~ ., family = "quasipoisson", 
##     data = Mat_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -15.3857   -4.5671    0.0845    3.3067   11.4538  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.912e+00  1.141e-01  51.813   <2e-16 ***
## TWO_OR_MORE           4.333e-04  3.038e-04   1.427   0.1637    
## MALE                  1.426e-04  1.865e-04   0.765   0.4503    
## AGE_20_TO_29         -1.702e-04  1.048e-04  -1.625   0.1143    
## AGE_30_TO_39          7.640e-05  1.274e-04   0.600   0.5529    
## NOT_ENROLLED          2.920e-04  2.787e-04   1.047   0.3030    
## COMMUTE_LESS_15       4.215e-04  2.587e-04   1.629   0.1133    
## TTLPOP_25PLUS_EDU    -5.498e-04  2.905e-04  -1.893   0.0678 .  
## ONLY_ENGLISH_LNG     -8.761e-06  6.760e-05  -0.130   0.8977    
## TTL_HOUSING_UNITS     2.438e-04  1.162e-04   2.099   0.0441 *  
## HH_INC_125000_149999  1.308e-03  6.410e-04   2.040   0.0499 *  
## BUILT_2000_2009      -4.229e-05  7.897e-05  -0.535   0.5961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 55.90103)
## 
##     Null deviance: 11489.5  on 42  degrees of freedom
## Residual deviance:  1775.6  on 31  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson
(Intercept)
TTL_HOUSING_UNITS
HH_INC_125000_149999
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
AGE_30_TO_39
HH_INC_125000_149999

Larcin

#Sélection de variables par corrélation avec la variable nb_larceny

library(MASS)
library(knitr)
library(kableExtra)

  

Mat_crime <- df_survey[,M]
Mat_crime$nb_larceny <- df_survey$nb_larceny

reg <- lm(nb_larceny~., data = Mat_crime)
summary(reg)
## 
## Call:
## lm(formula = nb_larceny ~ ., data = Mat_crime)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1363.03  -281.68    45.14   353.54  1271.53 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             -250.40330  197.70486  -1.267  0.21475   
## WHITE                      0.03021    0.12830   0.235  0.81537   
## AGE_30_TO_39               0.01165    0.17209   0.068  0.94648   
## GRADUATE_SCHOOL            2.06803    1.47863   1.399  0.17185   
## BACHELORS_OR_HIGHER_EDU   -0.17379    0.40439  -0.430  0.67035   
## RENTER_OCCUPIED_HU         0.41564    0.30155   1.378  0.17796   
## NONFAMILY_HOUSEHOLD        0.08172    0.32963   0.248  0.80583   
## HH_INC_125000_149999      -0.51607    1.16210  -0.444  0.66007   
## HH_INC_150000_199999      -3.30758    1.60624  -2.059  0.04796 * 
## HH_INC_OVER_200000         3.04069    1.07471   2.829  0.00811 **
## BUILT_2014_OR_LATER        0.73084    1.27046   0.575  0.56927   
## BUILT_2010_2013            1.45262    0.87755   1.655  0.10795   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 580.6 on 31 degrees of freedom
## Multiple R-squared:  0.8464, Adjusted R-squared:  0.792 
## F-statistic: 15.54 on 11 and 31 DF,  p-value: 1.084e-09
regfinale <- stepAIC(reg, direction = "backward", trace = F)
a <- summary(regfinale)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])

kable(affichage,col.names = "Variables sélectionnées par critère AIC")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par critère AIC
RENTER_OCCUPIED_HU
HH_INC_150000_199999
HH_INC_OVER_200000
BUILT_2010_2013
library(plotly)
library(knitr)
n <- sum(df_survey$nb_larceny)
p <- plot_ly(x = ~df_survey$nb_larceny,type = "histogram") 
p
p <- plot_ly(x = ~rpois(n,lambda = 0.5), type = "histogram")
p
library(glmnet)
library(kableExtra)

lasso <- glmnet(x = as.matrix(Mat_crime[,-11]),
                y = as.matrix(Mat_crime[,11])/sd(as.matrix(Mat_crime[,11])),
                family = "poisson")
plot(lasso, xvar = "lambda", label = T, main = "Sélection Lasso des variables")

kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
HH_INC_125000_149999
BUILT_2014_OR_LATER
glm <- glm(nb_larceny~., data = Mat_crime, family = "poisson")
summary(glm)
## 
## Call:
## glm(formula = nb_larceny ~ ., family = "poisson", data = Mat_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -27.336   -5.919   -2.386    4.258   29.577  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              5.738e+00  1.389e-02 413.196  < 2e-16 ***
## WHITE                    4.156e-06  8.918e-06   0.466  0.64120    
## AGE_30_TO_39             6.950e-05  9.014e-06   7.711 1.25e-14 ***
## GRADUATE_SCHOOL          1.101e-03  9.328e-05  11.798  < 2e-16 ***
## BACHELORS_OR_HIGHER_EDU -3.947e-04  2.795e-05 -14.120  < 2e-16 ***
## RENTER_OCCUPIED_HU       3.299e-04  1.701e-05  19.395  < 2e-16 ***
## NONFAMILY_HOUSEHOLD      1.912e-04  1.873e-05  10.207  < 2e-16 ***
## HH_INC_125000_149999    -2.578e-04  6.217e-05  -4.146 3.38e-05 ***
## HH_INC_150000_199999    -2.608e-04  9.636e-05  -2.707  0.00679 ** 
## HH_INC_OVER_200000       2.290e-03  6.668e-05  34.344  < 2e-16 ***
## BUILT_2014_OR_LATER      2.459e-04  7.779e-05   3.162  0.00157 ** 
## BUILT_2010_2013          4.837e-04  4.951e-05   9.770  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 40709.2  on 42  degrees of freedom
## Residual deviance:  4141.9  on 31  degrees of freedom
## AIC: 4525.9
## 
## Number of Fisher Scoring iterations: 5
library(AER)
fit <- glm(nb_larceny~., data = Mat_crime, family="poisson") 
dispersiontest(fit, trafo = 1)
## 
##  Overdispersion test
## 
## data:  fit
## z = 2.9528, p-value = 0.001575
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 102.2864
library(kableExtra)

fit.surdisp <- glm(nb_larceny~., data = Mat_crime,
                   family = "quasipoisson") 
a <- summary(fit.surdisp)
affichage <- rownames(a$coefficients[which(a$coefficients[,4]<0.05),])
a
## 
## Call:
## glm(formula = nb_larceny ~ ., family = "quasipoisson", data = Mat_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -27.336   -5.919   -2.386    4.258   29.577  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.738e+00  1.662e-01  34.519  < 2e-16 ***
## WHITE                    4.156e-06  1.067e-04   0.039  0.96919    
## AGE_30_TO_39             6.950e-05  1.079e-04   0.644  0.52421    
## GRADUATE_SCHOOL          1.101e-03  1.117e-03   0.986  0.33197    
## BACHELORS_OR_HIGHER_EDU -3.947e-04  3.346e-04  -1.180  0.24714    
## RENTER_OCCUPIED_HU       3.299e-04  2.036e-04   1.620  0.11530    
## NONFAMILY_HOUSEHOLD      1.912e-04  2.242e-04   0.853  0.40038    
## HH_INC_125000_149999    -2.578e-04  7.442e-04  -0.346  0.73141    
## HH_INC_150000_199999    -2.608e-04  1.153e-03  -0.226  0.82257    
## HH_INC_OVER_200000       2.290e-03  7.981e-04   2.869  0.00735 ** 
## BUILT_2014_OR_LATER      2.459e-04  9.311e-04   0.264  0.79342    
## BUILT_2010_2013          4.837e-04  5.927e-04   0.816  0.42064    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 143.2854)
## 
##     Null deviance: 40709.2  on 42  degrees of freedom
## Residual deviance:  4141.9  on 31  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
kable(affichage,col.names = "Variables sélectionnées par glm Quasi-Poisson")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variables sélectionnées par glm Quasi-Poisson
(Intercept)
HH_INC_OVER_200000
kable(names(which(log(lasso$beta[,10])!="-Inf")),col.names = c("Variable sélectionnées par la régression Lasso"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))%>%
  scroll_box(width = "100%")
Variable sélectionnées par la régression Lasso
HH_INC_125000_149999
BUILT_2014_OR_LATER