選択の効果

NISHIO Hirokazu [Translate]
選択の効果
#セレクションバイアス
Rでみるセレクションバイアスの例(webarchve)
たとえ学歴と能力に相関があったとしても、能力の高い人だけを取り出した集団の中では学歴と能力が無相関になりえる
この文章は、学歴（偏差値）と能力の関係について、セレクションバイアス（選択バイアス）の観点から分析したものです。主な内容は以下の通りです。
母集団全体では、偏差値と能力には正の相関がある。
能力の高い人だけを選抜した集団では、偏差値と能力の相関はほとんどみられなくなる。
一方、偏差値の高い人だけを選抜した集団では、母集団と同様に偏差値と能力に正の相関がみられる。
ただし、偏差値による選抜は、独立変数（偏差値）の分散を小さくし、推定値の標準誤差を大きくする問題がある。

この話の流れを再現するJupyter Notebookを生成して
以下は、この記事の内容を再現するJupyter Notebookです。
pyimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

# シミュレーションのセッティング
np.random.seed(1)
N = 100000
ability_mean, ability_std = 0, 1
hensachi_mean, hensachi_std = 50, 10
corr = 0.3

# 多変量正規分布からのサンプリング
cov = [[ability_std**2, corr*ability_std*hensachi_std], 
       [corr*ability_std*hensachi_std, hensachi_std**2]]
ability, hensachi = np.random.multivariate_normal([ability_mean, hensachi_mean], cov, N).T

# データフレームの作成
df = pd.DataFrame({'ability': ability, 'hensachi': hensachi})

# 散布図の作成
plt.figure(figsize=(6, 6))
sns.scatterplot(x='hensachi', y='ability', data=df, alpha=0.5)
plt.xlabel('偏差値')
plt.ylabel('能力')
plt.show()

# 能力によるセレクション
ability_threshold = 2
df['ability_selected'] = df['ability'] >= ability_threshold

plt.figure(figsize=(6, 6))
sns.scatterplot(x='hensachi', y='ability', data=df, hue='ability_selected', alpha=0.5)
plt.xlabel('偏差値')
plt.ylabel('能力')
plt.show()

# 偏差値によるセレクション
hensachi_threshold = 70
df['hensachi_selected'] = df['hensachi'] >= hensachi_threshold

plt.figure(figsize=(6, 6))
sns.scatterplot(x='hensachi', y='ability', data=df, hue='hensachi_selected', alpha=0.5)
plt.xlabel('偏差値')
plt.ylabel('能力')
plt.show()

# 回帰分析
model_all = smf.ols('ability ~ hensachi', data=df).fit()
model_ability = smf.ols('ability ~ hensachi', data=df[df['ability_selected']]).fit()
model_hensachi = smf.ols('ability ~ hensachi', data=df[df['hensachi_selected']]).fit()

print(model_all.summary())
print(model_ability.summary()) 
print(model_hensachi.summary())


:                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ability   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.090
Method:                 Least Squares   F-statistic:                     9896.
Date:                Wed, 29 May 2024   Prob (F-statistic):               0.00
Time:                        17:04:08   Log-Likelihood:            -1.3664e+05
No. Observations:              100000   AIC:                         2.733e+05
Df Residuals:                   99998   BIC:                         2.733e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.4885      0.015    -97.242      0.000      -1.519      -1.459
hensachi       0.0298      0.000     99.481      0.000       0.029       0.030
==============================================================================
Omnibus:                        2.020   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.364   Jarque-Bera (JB):                2.011
Skew:                           0.011   Prob(JB):                        0.366
Kurtosis:                       3.006   Cond. No.                         261.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

:                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ability   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.020
Method:                 Least Squares   F-statistic:                     48.42
Date:                Wed, 29 May 2024   Prob (F-statistic):           4.48e-12
Time:                        17:04:08   Log-Likelihood:                -761.44
No. Observations:                2282   AIC:                             1527.
Df Residuals:                    2280   BIC:                             1538.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0831      0.042     49.665      0.000       2.001       2.165
hensachi       0.0050      0.001      6.958      0.000       0.004       0.006
==============================================================================
Omnibus:                      659.594   Durbin-Watson:                   2.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1778.136
Skew:                           1.527   Prob(JB):                         0.00
Kurtosis:                       6.062   Cond. No.                         344.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

:                            OLS Regression Results                            
==============================================================================
Dep. Variable:                ability   R-squared:                       0.010
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     22.32
Date:                Wed, 29 May 2024   Prob (F-statistic):           2.44e-06
Time:                        17:04:08   Log-Likelihood:                -3223.3
No. Observations:                2301   AIC:                             6451.
Df Residuals:                    2299   BIC:                             6462.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.3937      0.441     -3.157      0.002      -2.259      -0.528
hensachi       0.0282      0.006      4.725      0.000       0.017       0.040
==============================================================================
Omnibus:                        1.860   Durbin-Watson:                   1.924
Prob(Omnibus):                  0.395   Jarque-Bera (JB):                1.804
Skew:                           0.067   Prob(JB):                        0.406
Kurtosis:                       3.030   Cond. No.                     1.59e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

提示された3つの回帰分析の結果から、以下のような解釈ができます。

全体のサンプルを使った分析 (N=100,000)
偏差値(hensachi)と能力(ability)の間には正の関係がみられる。
偏差値が1上がると、能力は約0.03ポイント高くなる。
モデルの説明力 (R-squared) は0.090で、それほど高くない。

能力が高い (ability≧2) サンプルのみを使った分析 (N=2,282)
偏差値と能力の関係は全体サンプルに比べて弱くなっている。
偏差値が1上がると、能力は約0.005ポイントしか高くならない。
モデルの説明力は0.021とかなり低い。

偏差値が高い (hensachi≧70) サンプルのみを使った分析 (N=2,301)
偏差値と能力の関係は、全体サンプルとほぼ同じ程度である。
偏差値が1上がると、能力は約0.028ポイント高くなる。
ただし、モデルの説明力は0.010と非常に低い。

これらの結果から、以下のようなことが示唆されます。
能力の高い人々だけを集めたサンプルでは、偏差値と能力の関係がほとんどみられなくなる（セレクションバイアスの影響）。
一方、偏差値の高い人々だけを集めたサンプルでは、全体サンプルと同様の関係が維持される。
ただし、偏差値によるセレクションは、サンプルサイズを小さくし、偏差値のばらつきを小さくするため、モデルの説明力を低下させる。

つまり、特定の属性によってセレクションされたサンプルを使った分析では、母集団全体の関係性を正しく捉えられない可能性があります。特に、分析の対象となる変数（ここでは能力）によるセレクションが行われている場合、その影響は大きいといえます。

#統計 #意外な現象
"Engineer's way of creating knowledge" the English version of my book is now available on [Engineer's way of creating knowledge]
(C)NISHIO Hirokazu / Converted from [Scrapbox] at 2/25/2026, 6:30:34 PM[Edit]
Related Pages