Document complémentaire au module 7 du cours SDD II de 2025-2026. Distribué sous licence CC BY-NC-SA 4.0.
Veuillez vous référer au cours en ligne pour les explications et les interprétations de cette analyse.
Installer un environnement R adéquat pour reproduire cette analyse.
# Chargement du dialecte SciViews::R avec le module d'exploration des données
SciViews::R("explore", lang = "fr")
# Lecture du jeu de données depuis le package mlbench
pima <- read("PimaIndiansDiabetes2", package = "mlbench")
pima## # A data.trame: [768 × 9]
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.29 33 pos
## 6 5 116 74 NA NA 25.6 0.201 30 neg
## 7 3 78 50 32 88 31 0.248 26 pos
## 8 10 115 NA NA NA 35.3 0.134 29 neg
## 9 2 197 70 45 543 30.5 0.158 53 pos
## 10 8 125 96 NA NA NA 0.232 54 pos
## # ℹ 758 more rows
## # A data.trame: [392 × 9]
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 1 89 66 23 94 28.1 0.167 21 neg
## 2 0 137 40 35 168 43.1 2.29 33 pos
## 3 3 78 50 32 88 31 0.248 26 pos
## 4 2 197 70 45 543 30.5 0.158 53 pos
## 5 1 189 60 23 846 30.1 0.398 59 pos
## 6 5 166 72 19 175 25.8 0.587 51 pos
## 7 0 118 84 47 230 45.8 0.551 31 pos
## 8 1 103 30 38 83 43.3 0.183 33 neg
## 9 1 115 70 30 96 34.6 0.529 32 pos
## 10 3 126 88 41 235 39.3 0.704 27 neg
## # ℹ 382 more rows
| Name | pima |
| Number of rows | 392 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| diabetes | 0 | 1 | FALSE | 2 | neg: 262, pos: 130 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| pregnant | 0 | 1 | 3.30 | 3.21 | 0.00 | 1.00 | 2.00 | 5.00 | 17.00 | ▇▂▂▁▁ |
| glucose | 0 | 1 | 122.63 | 30.86 | 56.00 | 99.00 | 119.00 | 143.00 | 198.00 | ▂▇▇▃▂ |
| pressure | 0 | 1 | 70.66 | 12.50 | 24.00 | 62.00 | 70.00 | 78.00 | 110.00 | ▁▂▇▆▁ |
| triceps | 0 | 1 | 29.15 | 10.52 | 7.00 | 21.00 | 29.00 | 37.00 | 63.00 | ▅▇▇▃▁ |
| insulin | 0 | 1 | 156.06 | 118.84 | 14.00 | 76.75 | 125.50 | 190.00 | 846.00 | ▇▂▁▁▁ |
| mass | 0 | 1 | 33.09 | 7.03 | 18.20 | 28.40 | 33.20 | 37.10 | 67.10 | ▃▇▃▁▁ |
| pedigree | 0 | 1 | 0.52 | 0.35 | 0.09 | 0.27 | 0.45 | 0.69 | 2.42 | ▇▃▁▁▁ |
| age | 0 | 1 | 30.86 | 10.20 | 21.00 | 23.00 | 27.00 | 36.00 | 81.00 | ▇▂▁▁▁ |
# Calcul de la matrice de corrélation, tableau formaté et graphique
pima_cor <- correlation(pima[, 2:8])
tabularise(pima_cor, digits = 2)Matrice de coefficients de corrélation de Pearson r | |||||||
|---|---|---|---|---|---|---|---|
| glucose | pressure | triceps | insulin | mass | pedigree | age |
glucose | 1.000 | 0.2100 | 0.199 | 0.5812 | 0.2095 | 0.140 | 0.3436 |
pressure | 0.210 | 1.0000 | 0.233 | 0.0985 | 0.3044 | -0.016 | 0.3000 |
triceps | 0.199 | 0.2326 | 1.000 | 0.1822 | 0.6644 | 0.160 | 0.1678 |
insulin | 0.581 | 0.0985 | 0.182 | 1.0000 | 0.2264 | 0.136 | 0.2171 |
mass | 0.210 | 0.3044 | 0.664 | 0.2264 | 1.0000 | 0.159 | 0.0698 |
pedigree | 0.140 | -0.0160 | 0.160 | 0.1359 | 0.1588 | 1.000 | 0.0850 |
age | 0.344 | 0.3000 | 0.168 | 0.2171 | 0.0698 | 0.085 | 1.0000 |
# ACP sur pima avec standardisation des données
pima_pca <- pca(data = pima, ~ glucose + pressure + triceps + insulin + mass +
pedigree + age, scale = TRUE)
# Idem, mais avec élimination des colonnes non utilisées auparavant
pima %>.%
sselect(., glucose:age) %>.%
pca(., scale = TRUE) %->%
pima_pca
# Résumé de notre ACP
summary(pima_pca)## Importance of components (eigenvalues):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Variance 2.412 1.288 1.074 0.878 0.6389 0.399 0.3098
## Proportion of Variance 0.345 0.184 0.153 0.126 0.0913 0.057 0.0443
## Cumulative Proportion 0.345 0.529 0.682 0.807 0.8988 0.956 1.0000
##
## Loadings (eigenvectors, rotation matrix):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## glucose 0.441 0.455 -0.198 -0.736
## pressure 0.329 -0.101 -0.613 0.206 -0.654 0.171
## triceps 0.439 -0.488 0.367 0.644
## insulin 0.402 0.418 0.263 -0.388 -0.123 0.642 0.129
## mass 0.446 -0.506 -0.181 -0.711
## pedigree 0.198 0.625 0.711 -0.251
## age 0.325 0.337 -0.384 0.471 0.592 0.168 -0.179
# Graphique de l'espace des variables pour les deux premiers axes de l'ACP
chart$loadings(pima_pca, choices = c(1, 2))# Graphique de l'espace des variables pour les axes PC1 et PC3
chart$loadings(pima_pca, choices = c(1, 3))# Graphique de l'espace des variables pour les axes PC2 et PC3
chart$loadings(pima_pca, choices = c(2, 3))# Graphique de l'espace des individus pour les deux premeirs axes de l'ACP
chart$scores(pima_pca, choices = c(1, 2), aspect.ratio = 3/5)# Idem, mais avec labels et ellipses
chart$scores(pima_pca, choices = c(1, 2),
labels = pima$diabetes) +
stat_ellipse()# Graphique de l'espace des individus pour PC1 et PC3 + labels et ellipses
chart$scores(pima_pca, choices = c(1, 3),
labels = pima$diabetes) +
stat_ellipse()# Graphique de l'espace des individus pour PC2 et PC3 + labels et ellipses
chart$scores(pima_pca, choices = c(2, 3),
labels = pima$diabetes) +
stat_ellipse()# Récupération des données depuis le package data.io
urchin <- read("urchin_bio", package = "data.io", lang = "FR")
urchin## # A data.trame: [421 × 19]
## origin diameter1 diameter2 height buoyant_weight weight solid_parts integuments dry_integuments digestive_tract
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Pêcherie 9.9 10.2 5 NA 0.522 0.478 0.366 NA 0.0525
## 2 Pêcherie 10.5 10.6 5.7 NA 0.642 0.589 0.445 NA 0.0482
## 3 Pêcherie 10.8 10.8 5.2 NA 0.734 0.677 0.533 NA 0.0758
## 4 Pêcherie 9.6 9.3 4.6 NA 0.370 0.344 0.266 NA 0.0442
## 5 Pêcherie 10.4 10.7 4.8 NA 0.610 0.559 0.406 NA 0.0743
## 6 Pêcherie 10.5 11.1 5 NA 0.610 0.551 0.427 NA 0.0492
## 7 Pêcherie 11 11 5.2 NA 0.672 0.605 0.452 NA 0.0734
## 8 Pêcherie 11.1 11.2 5.7 NA 0.703 0.628 0.484 NA 0.06
## 9 Pêcherie 9.4 9.2 4.6 NA 0.413 0.375 0.277 NA 0.0345
## 10 Pêcherie 10.1 9.5 4.7 NA 0.449 0.398 0.212 NA 0.0441
## # ℹ 411 more rows
## # ℹ 9 more variables: dry_digestive_tract <dbl>, gonads <dbl>, dry_gonads <dbl>, skeleton <dbl>, lantern <dbl>, test <dbl>,
## # spines <dbl>, maturity <int>, sex <fct>
# Elimination des variables inutilisées et des données manquantes
urchin %>.%
sselect(., -(skeleton:spines), -sex) %>.%
sdrop_na(.) ->
urchin2
urchin2## # A data.trame: [319 × 14]
## origin diameter1 diameter2 height buoyant_weight weight solid_parts integuments dry_integuments digestive_tract
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Pêcherie 16.7 16.8 8.4 0.588 2.58 2.04 1.77 1.06 0.0644
## 2 Pêcherie 19.9 20 9.2 1.10 4.26 3.66 3.11 1.95 0.217
## 3 Pêcherie 19.9 19.2 8.5 0.629 2.93 2.43 2.13 1.17 0.115
## 4 Pêcherie 19.3 19.8 10.2 0.781 3.71 3.09 2.52 1.45 0.179
## 5 Pêcherie 18.8 20 9.3 0.761 3.59 2.99 2.4 1.41 0.282
## 6 Pêcherie 21.5 20.9 9.6 1.13 4.98 4.42 3.47 2.04 0.466
## 7 Pêcherie 17.4 16.5 7.8 0.477 2.33 1.97 1.57 0.89 0.0785
## 8 Pêcherie 21 21.2 10.8 1.23 5.4 4.55 3.81 2.21 0.358
## 9 Pêcherie 17.8 18.8 8.6 0.548 2.58 2.07 1.71 1.05 0.120
## 10 Pêcherie 19.7 19.6 9.7 0.862 3.59 3.08 2.51 1.59 0.154
## # ℹ 309 more rows
## # ℹ 4 more variables: dry_digestive_tract <dbl>, gonads <dbl>, dry_gonads <dbl>, maturity <int>
| Name | urchin2 |
| Number of rows | 319 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| origin | 0 | 1 | FALSE | 2 | Cul: 188, Pêc: 131 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| diameter1 | 0 | 1 | 32.78 | 11.71 | 14.60 | 23.25 | 31.50 | 39.65 | 65.60 | ▇▇▅▃▁ |
| diameter2 | 0 | 1 | 32.71 | 11.67 | 15.00 | 23.45 | 31.60 | 39.60 | 65.60 | ▇▇▆▃▁ |
| height | 0 | 1 | 16.78 | 6.25 | 7.30 | 11.10 | 16.20 | 21.50 | 32.20 | ▇▆▆▅▂ |
| buoyant_weight | 0 | 1 | 4.27 | 3.84 | 0.31 | 1.35 | 3.18 | 5.67 | 17.73 | ▇▃▁▁▁ |
| weight | 0 | 1 | 21.80 | 21.37 | 1.61 | 6.08 | 15.25 | 28.14 | 100.51 | ▇▂▁▁▁ |
| solid_parts | 0 | 1 | 16.52 | 15.27 | 1.46 | 4.96 | 11.73 | 21.69 | 73.14 | ▇▃▁▁▁ |
| integuments | 0 | 1 | 12.32 | 10.64 | 1.09 | 4.00 | 9.40 | 16.08 | 47.22 | ▇▃▂▁▁ |
| dry_integuments | 0 | 1 | 7.16 | 6.30 | 0.58 | 2.22 | 5.42 | 9.42 | 28.80 | ▇▃▂▁▁ |
| digestive_tract | 0 | 1 | 1.90 | 2.03 | 0.03 | 0.45 | 1.21 | 2.54 | 10.37 | ▇▂▁▁▁ |
| dry_digestive_tract | 0 | 1 | 0.23 | 0.21 | 0.01 | 0.07 | 0.17 | 0.31 | 1.02 | ▇▃▂▁▁ |
| gonads | 0 | 1 | 1.72 | 2.65 | 0.00 | 0.10 | 0.63 | 2.20 | 15.93 | ▇▁▁▁▁ |
| dry_gonads | 0 | 1 | 0.51 | 0.82 | 0.00 | 0.03 | 0.17 | 0.64 | 5.00 | ▇▁▁▁▁ |
| maturity | 0 | 1 | 0.37 | 0.71 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▂ |
# Matrice de corrélation, tableau formaté et graphique
urchin2_cor <- correlation(urchin2[, 2:13])
tabularise(urchin2_cor)Matrice de coefficients de corrélation de Pearson r | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| diameter1 | diameter2 | height | buoyant_weight | weight | solid_parts | integuments | dry_integuments | digestive_tract | dry_digestive_tract | gonads | dry_gonads |
diameter1 | 1.000 | 0.998 | 0.976 | 0.952 | 0.956 | 0.956 | 0.968 | 0.956 | 0.910 | 0.932 | 0.798 | 0.788 |
diameter2 | 0.998 | 1.000 | 0.974 | 0.952 | 0.957 | 0.957 | 0.969 | 0.956 | 0.912 | 0.933 | 0.798 | 0.788 |
height | 0.976 | 0.974 | 1.000 | 0.928 | 0.924 | 0.926 | 0.944 | 0.934 | 0.875 | 0.909 | 0.759 | 0.747 |
buoyant_weight | 0.952 | 0.952 | 0.928 | 1.000 | 0.986 | 0.992 | 0.994 | 0.999 | 0.921 | 0.939 | 0.882 | 0.868 |
weight | 0.956 | 0.957 | 0.924 | 0.986 | 1.000 | 0.994 | 0.991 | 0.986 | 0.952 | 0.955 | 0.882 | 0.875 |
solid_parts | 0.956 | 0.957 | 0.926 | 0.992 | 0.994 | 1.000 | 0.994 | 0.991 | 0.947 | 0.955 | 0.907 | 0.897 |
integuments | 0.968 | 0.969 | 0.944 | 0.994 | 0.991 | 0.994 | 1.000 | 0.996 | 0.934 | 0.946 | 0.867 | 0.854 |
dry_integuments | 0.956 | 0.956 | 0.934 | 0.999 | 0.986 | 0.991 | 0.996 | 1.000 | 0.920 | 0.938 | 0.874 | 0.860 |
digestive_tract | 0.910 | 0.912 | 0.875 | 0.921 | 0.952 | 0.947 | 0.934 | 0.920 | 1.000 | 0.980 | 0.806 | 0.807 |
dry_digestive_tract | 0.932 | 0.933 | 0.909 | 0.939 | 0.955 | 0.955 | 0.946 | 0.938 | 0.980 | 1.000 | 0.816 | 0.818 |
gonads | 0.798 | 0.798 | 0.759 | 0.882 | 0.882 | 0.907 | 0.867 | 0.874 | 0.806 | 0.816 | 1.000 | 0.991 |
dry_gonads | 0.788 | 0.788 | 0.747 | 0.868 | 0.875 | 0.897 | 0.854 | 0.860 | 0.807 | 0.818 | 0.991 | 1.000 |
# Graphique de la masse totle en fonction du diamètre 1
chart(data = urchin2, weight ~ diameter1) +
geom_point()# Graphique de la masse totle en fonction du diamètre 1, avec double log
chart(data = urchin2, log(weight) ~ log(diameter1)) +
geom_point()# ACP sur données transformées log(x + 1) et standardisation
urchin2 %>.%
sselect(., -origin, -maturity) %>.% # Élimine les variables non quantitatives
log1p(.) %>.% # Transforme toutes les autres en log(x + 1)
pca(data = ., ~., scale = TRUE) -> # Effectue l'ACP après standardisation
urchin2_pca
# Résumé de l'ACP
summary(urchin2_pca)## Importance of components (eigenvalues):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
## Variance 11.219 0.5010 0.1813 0.03862 0.02601 0.01657 0.00931 0.00336 0.00210 0.00108 0.00082 0.00034
## Proportion of Variance 0.935 0.0418 0.0151 0.00322 0.00217 0.00138 0.00078 0.00028 0.00017 0.00009 0.00007 0.00003
## Cumulative Proportion 0.935 0.9767 0.9918 0.99503 0.99720 0.99858 0.99936 0.99964 0.99981 0.99990 0.99997 1.00000
##
## Loadings (eigenvectors, rotation matrix):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
## diameter1 0.295 -0.162 -0.441 0.237 0.174 -0.177 0.242 -0.706
## diameter2 0.295 -0.166 -0.449 0.249 0.178 -0.157 0.124 0.688 -0.263
## height 0.291 -0.218 0.154 -0.106 -0.902
## buoyant_weight 0.296 0.120 0.509 0.124 0.530 0.265 0.504
## weight 0.296 -0.149 -0.145 0.116 0.912
## solid_parts 0.297 -0.106 0.127 -0.234 -0.594 -0.216 0.638
## integuments 0.296 -0.159 0.157 0.153 0.125 -0.396 -0.148 -0.702 0.371
## dry_integuments 0.296 -0.115 0.160 0.455 0.114 0.105 -0.128 -0.133 -0.774
## digestive_tract 0.288 -0.571 -0.187 -0.485 0.519 0.201
## dry_digestive_tract 0.283 -0.702 0.217 0.278 -0.513 -0.161
## gonads 0.271 0.568 0.226 -0.575 -0.430 0.143
## dry_gonads 0.259 0.697 -0.104 0.465 0.453 -0.111
# Graphique de l'espace des variables de l'ACP pour les deux premiers axes (par défaut)
chart$loadings(urchin2_pca)# Nouvelle ACP sur données divisées par la masse immergée
urchin2 %>.%
sselect(., -origin, -maturity, -buoyant_weight) %>.% # Élimination des variables inutiles
(. / urchin2$buoyant_weight) %>.% # Division par buoyant_weight
log1p(.) -> # Transformation log(x + 1)
urchin3
head(urchin3)## # A data.trame: [6 × 11]
## diameter1 diameter2 height weight solid_parts integuments dry_integuments digestive_tract dry_digestive_tract gonads
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.38 3.39 2.73 1.68 1.50 1.39 1.03 0.104 0.0267 0.00914
## 2 2.95 2.96 2.24 1.59 1.47 1.35 1.02 0.181 0.0437 0.0402
## 3 3.49 3.45 2.68 1.73 1.58 1.48 1.05 0.168 0.0361 0
## 4 3.25 3.27 2.64 1.75 1.60 1.44 1.05 0.206 0.0476 0.0232
## 5 3.25 3.31 2.58 1.74 1.60 1.42 1.05 0.315 0.0661 0.0289
## 6 3.00 2.97 2.25 1.69 1.59 1.41 1.03 0.346 0.0554 0.0166
## # ℹ 1 more variable: dry_gonads <dbl>
# ACP avec standardisation
urchin3_pca <- pca(data = urchin3, ~., scale = TRUE)
# Résumé de l'ACP
summary(urchin3_pca)## Importance of components (eigenvalues):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## Variance 4.687 3.353 1.273 0.9666 0.3668 0.1724 0.10547 0.04761 0.01943 0.00834 0.00068
## Proportion of Variance 0.426 0.305 0.116 0.0879 0.0333 0.0157 0.00959 0.00433 0.00177 0.00076 0.00006
## Cumulative Proportion 0.426 0.731 0.847 0.9345 0.9678 0.9835 0.99308 0.99741 0.99918 0.99994 1.00000
##
## Loadings (eigenvectors, rotation matrix):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## diameter1 -0.425 0.145 0.297 0.101 0.101 -0.400 -0.714
## diameter2 -0.425 -0.101 0.143 0.296 0.103 -0.425 0.700
## height -0.427 0.131 0.300 0.141 -0.197 0.790
## weight 0.189 -0.428 -0.216 0.497 0.672 0.145 -0.109
## solid_parts -0.495 0.254 0.117 -0.335 0.245 -0.662 0.223
## integuments -0.142 -0.463 0.152 -0.283 0.165 -0.345 -0.667 0.266
## dry_integuments -0.259 -0.242 0.173 -0.533 -0.669 0.161 0.277
## digestive_tract 0.214 -0.370 -0.448 0.102 -0.388 0.462 0.468 -0.148
## dry_digestive_tract -0.360 -0.485 0.401 -0.429 0.338 -0.382 -0.155
## gonads 0.374 0.438 0.257 -0.223 -0.725 -0.130
## dry_gonads 0.371 0.440 0.269 -0.162 0.122 0.446 0.589
# Graphique de l'espace des variables de l'ACP pour les deux premiers axes
chart$loadings(urchin3_pca, choices = c(1, 2))# Graphique de l'espace des variables de l'ACP pour les PC2 et PC3
chart$loadings(urchin3_pca, choices = c(2, 3))# Graphique de l'espace des individus pour PC1 et PC2 avec couleur et ellipses
chart$scores(urchin3_pca, choices = c(1, 2),
col = urchin2$origin, labels = urchin2$maturity, aspect.ratio = 3/5) +
theme(legend.position = "right") +
stat_ellipse()