[PlayGround S4E1] 데이터분석(EDA)

[Kaggle]

[PlayGround S4E1] 데이터분석(EDA)

indongspace 2024. 1. 17. 11:23

https://www.kaggle.com/code/akhiljethwa/playground-s4e1-eda-modeling-xgboost

[PlayGround S4E1] 📊 EDA + 🤖 Modeling [XGBoost]

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

연습 데이터 출처

[Kaggle] 30대와 40대 이탈 고객의 특징 비교분석 (Playground S4E1) (tistory.com)

[Kaggle] 30대와 40대 이탈 고객의 특징 비교분석 (Playground S4E1)

# 프로젝트 개요 주제: 30대와 40대 이탈 고객의 특징 비교분석 진행 일시: 2024.01.15 ~ 2024.01.17 # 사용된 라이브러리 Numpy Pandas Matplotlib Seaborn scipy import numpy as np import pandas as pd import matplotlib.pyplot as pl

juyoungeeya.tistory.com

공부하는데 도움이 된 팀원의 블로그

[멀티캠퍼스 데이터분석&엔지니어 34회차 데이터 분석 연습]

# 프로젝트 개요

주제: 30대와 40대 이탈 고객의 특징 비교분석
진행 일시: 2024.01.15 ~ 2024.01.17

# 사용된 라이브러리

Numpy
Pandas
Matplotlib
Seaborn
scipy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp

# 데이터 설명

Bank Customer Churn Prediction dataset을 기반으로 한 딥러닝을 통해 생성된 데이터
- https://www.kaggle.com/competitions/playground-series-s4e1/data
Bank Customer Churn Prediction dataset
- https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction

Bank Customer Churn Prediction

Bank Customer Dataset for Churn prediction

www.kaggle.com

# 데이터 준비

train_data = pd.read_csv('/kaggle/input/playground-series-s4e1/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e1/test.csv')
sample_submission = pd.read_csv('/kaggle/input/playground-series-s4e1/sample_submission.csv')

origional_data = pd.read_csv('/kaggle/input/bank-customer-churn-prediction/Churn_Modelling.csv')

# 변수 설명

Customer ID: 각 고객의 고유 번호
Surname: 고객의 성
Credit Score: 고객의 신용 점수
Geography: 고객의 거주 국가
Gender: 고객의 성별
Age: 고객의 나이
Tenure: 고객이 은행 서비스를 사용한 연수
Balance: 고객의 계좌 잔액
NumOfProducts: 고객이 이용하는 은행 상품 수
HasCrCard: 고객의 신용카드 보유 여부
IsActiveMember: 고객의 활성화 여부
EstimatedSalary: 고객의 예상 급여
Exited: 고객의 이탈 여부

# 프로젝트 배경

데이터 탐색 중 다변량 분석에서 나이와 고객 이탈의 연관성 확인: 특정 나이대에서 이탈 고객이 많은 것으로 나타났다.

# Load data
df3 = train_data[['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Exited']].copy()

# Mapping dictionary for label changes
exit_label_mapping = {0: 'Not Exited', 1: 'Exited'}

# Change the values in the 'Exited' column using the mapping dictionary
df3['Exited'] = df3['Exited'].map(exit_label_mapping)

# Create a pairplot
sns.pairplot(df3, hue="Exited", corner=True)

# Show the plot
plt.show()

# Grouping by age ranges

# Define the bins and corresponding labels for age groups
bins = [9, 19, 29, 39, 49, 59, 69, 79, 89, 99]
age_labels = ['10s', '20s', '30s', '40s', '50s', '60s', '70s', '80s', '90s']

# Create a new column 'Ages' in the 'train_data' DataFrame based on the specified bins and labels
train_data['Ages'] = pd.cut(train_data['Age'], bins, labels=age_labels)

# Create new datasets for each age group
group_10s = train_data.loc[train_data['Ages'] == '10s', :]
group_20s = train_data.loc[train_data['Ages'] == '20s', :]
group_30s = train_data.loc[train_data['Ages'] == '30s', :]
group_40s = train_data.loc[train_data['Ages'] == '40s', :]
group_50s = train_data.loc[train_data['Ages'] == '50s', :]
group_60s = train_data.loc[train_data['Ages'] == '60s', :]
group_70s = train_data.loc[train_data['Ages'] == '70s', :]
group_80s = train_data.loc[train_data['Ages'] == '80s', :]
group_90s = train_data.loc[train_data['Ages'] == '90s', :]

# Create a 3x3 subplot grid
fig, ax = plt.subplots(3, 3, figsize=(18, 15))

# Define age groups and corresponding DataFrames
age_groups = {'10s': group_10s, '20s': group_20s, '30s': group_30s, '40s': group_40s,
              '50s': group_50s, '60s': group_60s, '70s': group_70s, '80s': group_80s, '90s': group_90s}

# Iterate over rows and columns of the subplot grid
for i, (age, group) in enumerate(age_groups.items()):
    row = i // 3
    col = i % 3
    
    # Plot pie chart for each age group
    pie_chart = group['Exited'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[row, col], shadow=True)
    ax[row, col].set_title(age)
    pie_chart.set_ylabel('')
    
    # Add legend for 'Not Exited' and 'Exited' at the top-left corner
    labels = ['Not Exited', 'Exited']
    ax[row, col].legend(labels, loc='upper left')
    
# Set the overall title for the entire figure
plt.suptitle('Exited_Ages')
plt.show()

나이대별로 이탈자 비율 확인: 40대에서 이탈 고객 비율이 급격하게 증가하는 것으로 나타났다.
- ==> 30대와 40대 이탈 고객에는 어떤 차이가 있을까?

# 30대와 40대 이탈 고객의 특징 비교

# Create datasets for customers who exited in their 30s and 40s
exited_30s = group_30s.loc[group_30s['Exited'] == 1, :]
exited_40s = group_40s.loc[group_40s['Exited'] == 1, :]

# Concatenate the data for exited customers in their 30s and 40s
exited_3040s = pd.concat([exited_30s, exited_40s])

범주형 변수: Geography, Gender, Tenure, NumOfProducts, HasCrCard, IsActiveMember
- ==> 카이스퀘어 검정으로 30대 이탈 고객과 40대 이탈 고객에서 다른 경향을 나타내는지 확인

# Chi-square test for independence for Geography between exited customers in their 30s and 40s

# Importing necessary libraries
import numpy as np
from scipy.stats import chi2_contingency

# Creating a contingency table (cross-tab) for Ages and Geography
crcard_cross = pd.crosstab(exited_3040s['Ages'], exited_3040s['Geography'])

# Performing the Chi-square test for independence
chi2_geography, p_geography, dof, ef = chi2_contingency(crcard_cross, correction=False)

# Displaying the test statistics
print(f'Chi-square = {chi2_geography:.3f}')
print(f'p = {p_geography:.3f}')

# Interpreting the result based on the p-value
alpha = 0.05
if p_geography < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")
    
# Chi-square = 21.585
# p = 0.000
# Reject the null hypothesis.

30대 이탈 고객과 40대 이탈 고객의 Geography에 차이가 있는지
- 귀무가설: Ages(30대와 40대)와 Geography는 독립이다.
- 대립가설: Ages(30대와 40대)와 Geography는 독립이 아니다.
- 검정 결과: Chi-square = 21.585, p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 Geography에는 차이가 있다.

# Chi-square test for independence for Gender between exited customers in their 30s and 40s

# Creating a contingency table (cross-tab) for Ages and Gender
crcard_cross = pd.crosstab(exited_3040s['Ages'], exited_3040s['Gender'])

# Performing the Chi-square test for independence
chi2_gender, p_gender, dof, ef = chi2_contingency(crcard_cross, correction=False)

# Displaying the test statistics
print(f'Chi-square = {chi2_gender:.3f}')
print(f'p = {p_gender:.3f}')

# Interpreting the result based on the p-value
alpha = 0.05
if p_gender < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Chi-square = 14.056
# p = 0.000
# Reject the null hypothesis.

30대 이탈 고객과 40대 이탈 고객의 Gender에 차이가 있는지
- 귀무가설: Ages(30대와 40대)와 Gender는 독립이다.
- 대립가설: Ages(30대와 40대)와 Gender는 독립이 아니다.
- 검정 결과: Chi-square = 14.056, p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 Gender에는 차이가 있다.

# Chi-square test for independence for Tenure between exited customers in their 30s and 40s

# Creating a contingency table (cross-tab) for Ages and Tenure
crcard_cross = pd.crosstab(exited_3040s['Ages'], exited_3040s['Tenure'])

# Performing the Chi-square test for independence
chi2_tenure, p_tenure, dof, ef = chi2_contingency(crcard_cross, correction=False)

# Displaying the test statistics
print(f'Chi-square = {chi2_tenure:.3f}')
print(f'p = {p_tenure:.3f}')

# Interpreting the result based on the p-value
alpha = 0.05
if p_tenure < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Chi-square = 18.167
# p = 0.052
# Fail to reject the null hypothesis.

30대 이탈 고객과 40대 이탈 고객의 Tenure에 차이가 있는지
- 귀무가설: Ages(30대와 40대)와 Tenure는 독립이다.
- 대립가설: Ages(30대와 40대)와 Tenure는 독립이 아니다.
- 검정 결과: Chi-square = 18.167, p = 0.052
  - ==> p가 유의수준 0.05보다 크므로 귀무가설 채택, 30대와 40대의 Tenure에는 차이가 없다.

# Chi-square test for independence for NumOfProducts between exited customers in their 30s and 40s

# Creating a contingency table (cross-tab) for Ages and NumOfProducts
crcard_cross = pd.crosstab(exited_3040s['Ages'], exited_3040s['NumOfProducts'])

# Performing the Chi-square test for independence
chi2_products, p_products, dof, ef = chi2_contingency(crcard_cross, correction=False)

# Displaying the test statistics
print(f'Chi-square = {chi2_products:.3f}')
print(f'p = {p_products:.3f}')

# Interpreting the result based on the p-value
alpha = 0.05
if p_products < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Chi-square = 44.150
# p = 0.000
# Reject the null hypothesis.

30대 이탈 고객과 40대 이탈 고객의 NumOfProducts에 차이가 있는지
- 귀무가설: Ages(30대와 40대)와 NumOfProducts는 독립이다.
- 대립가설: Ages(30대와 40대)와 NumOfProducts는 독립이 아니다.
- 검정 결과: Chi-square = 44.150, p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 NumOfProducts에는 차이가 있다.

# Chi-square test for independence for HasCrCard between exited customers in their 30s and 40s

# Creating a contingency table (cross-tab) for Ages and HasCrCard
crcard_cross = pd.crosstab(exited_3040s['Ages'], exited_3040s['HasCrCard'])

# Performing the Chi-square test for independence
chi2_crcard, p_crcard, dof, ef = chi2_contingency(crcard_cross, correction=False)

# Displaying the test statistics
print(f'Chi-square = {chi2_crcard:.3f}')
print(f'p = {p_crcard:.3f}')

# Interpreting the result based on the p-value
alpha = 0.05
if p_crcard < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Chi-square = 21.744
# p = 0.000
# Reject the null hypothesis.

30대 이탈 고객과 40대 이탈 고객의 HasCrCard에 차이가 있는지
- 귀무가설: Ages(30대와 40대)와 HasCrCard는 독립이다.
- 대립가설: Ages(30대와 40대)와 HasCrCard는 독립이 아니다.
- 검정 결과: Chi-square = 21.744, p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 HasCrCard에는 차이가 있다.

# Chi-square test for independence for IsActiveMember between exited customers in their 30s and 40s

# Creating a contingency table (cross-tab) for Ages and IsActiveMember
crcard_cross = pd.crosstab(exited_3040s['Ages'], exited_3040s['IsActiveMember'])

# Performing the Chi-square test for independence
chi2_active, p_active, dof, ef = chi2_contingency(crcard_cross, correction=False)

# Displaying the test statistics
print(f'Chi-square = {chi2_active:.3f}')
print(f'p = {p_active:.3f}')

# Interpreting the result based on the p-value
alpha = 0.05
if p_active < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Chi-square = 40.818
# p = 0.000
# Reject the null hypothesis.

30대 이탈 고객과 40대 이탈 고객의 IsActiveMember에 차이가 있는지
- 귀무가설: Ages(30대와 40대)와 IsActiveMember는 독립이다.
- 대립가설: Ages(30대와 40대)와 IsActiveMember는 독립이 아니다.
- 검정 결과: Chi-square = 40.818, p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 IsActiveMember에는 차이가 있다.

연속형 변수: CreditScore, Balance, EstimatedSalary
- ==> 독립표본 t 검정으로 30대 이탈 고객과 40대 이탈 고객에서 다른 경향을 나타내는지 확인

# Testing homogeneity of variances for CreditScore between exited customers in their 30s and 40s

# Importing the Bartlett test
from scipy.stats import bartlett

# Performing Bartlett test for homogeneity of variances
statistic, p_value = bartlett(exited_30s['CreditScore'], exited_40s['CreditScore'])

# Setting the significance level (alpha)
alpha = 0.05

# Displaying the test statistics
print(f"p = {p_value:.3f}")

# Interpreting the result based on p-value
if p_value > alpha:
    print("Homogeneity of variances can be assumed.")
else:
    print("Homogeneity of variances cannot be assumed.")

# p = 0.000
# Homogeneity of variances cannot be assumed.

30대 이탈 고객과 40대 이탈 고객의 CreditScore에 차이가 있는지

등분산성 검정
- 귀무가설: 등분산성을 가정할 수 있다.
- 대립가설: 등분산성을 가정할 수 없다.
- 검정 결과: p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 등분산성을 가정할 수 없다.

# Independent samples t-test for CreditScore between exited customers in their 30s and 40s

# Importing the necessary library
from scipy import stats

# Performing independent samples t-test
t_crscore, p_crscore = stats.ttest_ind(exited_30s['CreditScore'], exited_40s['CreditScore'], equal_var=False)

# Displaying the test statistics
print(f"t = {t_crscore:.3f}")
print(f"p = {p_crscore:.3f}")

# Interpreting the result based on the p-value
if p_crscore < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# t = 0.228
# p = 0.820
# Fail to reject the null hypothesis.

독립표본 t 검정
- 귀무가설: 30대와 40대의 CreditScore의 평균은 같다.
- 대립가설: 30대와 40대의 CreditScore의 평균은 같지 않다.
- 검정 결과: t = 0.228, p = 0.820
  - ==> p가 유의수준 0.05보다 크므로 귀무가설 채택, 30대와 40대의 CreditScore의 평균은 같다.

# Testing homogeneity of variances for Balance between exited customers in their 30s and 40s

# Performing Bartlett test for homogeneity of variances
statistic, p_value = bartlett(exited_30s['Balance'], exited_40s['Balance'])

# Setting the significance level (alpha)
alpha = 0.05

# Displaying the test statistics
print(f"p = {p_value:.3f}")

# Interpreting the result based on p-value
if p_value > alpha:
    print("Homogeneity of variances can be assumed.")
else:
    print("Homogeneity of variances cannot be assumed.")

# p = 0.822
# Homogeneity of variances can be assumed.

30대 이탈 고객과 40대 이탈 고객의 Balance에 차이가 있는지

등분산성 검정
- 귀무가설: 등분산성을 가정할 수 있다.
- 대립가설: 등분산성을 가정할 수 없다.
- 검정 결과: p = 0.822
  - ==> p가 유의수준 0.05보다 크므로 귀무가설 기각, 등분산성을 가정할 수 있다.

# Independent samples t-test for Balance between exited customers in their 30s and 40s

# Performing independent samples t-test
t_balance, p_balance = stats.ttest_ind(exited_30s['Balance'], exited_40s['Balance'], equal_var=True)

# Displaying the test statistics
print(f"t = {t_balance:.3f}")
print(f"p = {p_balance:.3f}")

# Interpreting the result based on the p-value
if p_balance < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# t = 2.291
# p = 0.022
# Reject the null hypothesis.

독립표본 t 검정
- 귀무가설: 30대와 40대의 Balance의 평균은 같다.
- 대립가설: 30대와 40대의 Balance의 평균은 같지 않다.
- 검정 결과: t = 2.291, p = 0.022
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 Balance의 평균은 같지 않다.

# Testing homogeneity of variances for EstimatedSalary between exited customers in their 30s and 40s

# Performing Bartlett test for homogeneity of variances
statistic, p_value = bartlett(exited_30s['EstimatedSalary'], exited_40s['EstimatedSalary'])

# Setting the significance level (alpha)
alpha = 0.05

# Displaying the test statistics
print(f"p = {p_value:.3f}")

# Interpreting the result based on p-value
if p_value > alpha:
    print("Homogeneity of variances can be assumed.")
else:
    print("Homogeneity of variances cannot be assumed.")

# p = 0.001
# Homogeneity of variances cannot be assumed.

30대 이탈 고객과 40대 이탈 고객의 EstimatedSalary에 차이가 있는지

등분산성 검정
- 귀무가설: 등분산성을 가정할 수 있다.
- 대립가설: 등분산성을 가정할 수 없다.
- 검정 결과: p = 0.001
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 등분산성을 가정할 수 없다.

# Independent samples t-test for EstimatedSalary between exited customers in their 30s and 40s

# Performing independent samples t-test
t_salary, p_salary = stats.ttest_ind(exited_30s['EstimatedSalary'], exited_40s['EstimatedSalary'], equal_var=False)

# Displaying the test statistics
print(f"t = {t_salary:.3f}")
print(f"p = {p_salary:.3f}")

# Interpreting the result based on the p-value
if p_salary < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# t = -3.966
# p = 0.000
# Reject the null hypothesis.

독립표본 t 검정
- 귀무가설: 30대와 40대의 EstimatedSalary의 평균은 같다.
- 대립가설: 30대와 40대의 EstimatedSalary의 평균은 같지 않다.
- 검정 결과: t = -3.966, p = 0.000
  - ==> p가 유의수준 0.05보다 작으므로 귀무가설 기각, 30대와 40대의 EstimatedSalary의 평균은 같지 않다.

# 유의미한 차이를 보인 변수 시각화

30대와 40대에서 통계적으로 유의미한 차이를 보인 변수: Geography, Gender, NumOfProducts, HasCrCard, IsActiveMember, Balance, EstimatedSalary

# Compare Geography for customers who exited in their 30s and 40s

# Importing the necessary library
import matplotlib.pyplot as plt

# Create a 1x2 subplot grid
fig, ax = plt.subplots(1, 2, figsize=(18, 5.5))

# Iterate over age groups
for i, (age_group, exited_data) in enumerate([('30s', exited_30s), ('40s', exited_40s)]):
    # Plot pie chart for Geography in each age group
    exited_data['Geography'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[i], shadow=True)
    ax[i].set_title(f'{age_group}')
    ax[i].set_ylabel('')
    
    # Add legend at the top-left corner
    labels = ['France', 'Germany', 'Spain']
    ax[i].legend(labels, loc='upper left')

# Add text below the center of the figure
fig.text(0.5, 0.05, f'Chi-square = {chi2_geography:.3f}, p = {p_geography:.3f}', ha='center', va='center')

# Set the overall title for the entire figure
plt.suptitle('Geography Comparison for Exiting Customers')
plt.show()

Geography
- 30대에서 프랑스와 스페인 거주자가 차지하는 비율이 40대보다 높은 것으로 나타났다.
- 40대에서 독일 거주자가 차지하는 비율이 30대보다 높은 것으로 나타났다.

# Compare Gender for customers who exited in their 30s and 40s

# Importing the necessary library
import matplotlib.pyplot as plt

# Create a 1x2 subplot grid
fig, ax = plt.subplots(1, 2, figsize=(18, 5.5))

# Iterate over age groups
for i, (age_group, exited_data) in enumerate([('30s', exited_30s), ('40s', exited_40s)]):
    # Plot pie chart for Gender in each age group
    exited_data['Gender'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[i], shadow=True)
    ax[i].set_title(f'{age_group}')
    ax[i].set_ylabel('')
    
    # Add legend at the top-left corner
    labels = ['Female', 'Male']
    ax[i].legend(labels, loc='upper left')

# Add text below the center of the figure
fig.text(0.5, 0.05, f'Chi-square = {chi2_gender:.3f}, p = {p_gender:.3f}', ha='center', va='center')    
    
# Set the overall title for the entire figure
plt.suptitle('Gender Comparison for Exiting Customers')
plt.show()

Gender
- 30대에서 남성이 차지하는 비율이 40대보다 높은 것으로 나타났다.
- 40대에서 여성이 차지하는 비율이 30대보다 높은 것으로 나타났다.

# Compare NumOfProducts for customers who exited in their 30s and 40s

# Create a 1x2 subplot grid
fig, ax = plt.subplots(1, 2, figsize=(18, 5.5))

# Iterate over age groups
for i, (age_group, exited_data) in enumerate([('30s', exited_30s), ('40s', exited_40s)]):
    # Plot pie chart for Gender in each age group
    exited_data['NumOfProducts'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[i], shadow=True)
    ax[i].set_title(f'{age_group}')
    ax[i].set_ylabel('')
    
    # Add legend at the top-left corner
    labels = ['1', '2', '3', '4']
    ax[i].legend(labels, loc='upper left')

# Add text below the center of the figure
fig.text(0.5, 0.05, f'Chi-square = {chi2_products:.3f}, p = {p_products:.3f}', ha='center', va='center')    

# Set the overall title for the entire figure
plt.suptitle('NumOfProducts Comparison for Exiting Customers')
plt.show()

NumOfProducts
- 30대에서 은행 상품을 3개 사용하는 고객이 차지하는 비율이 40대보다 높은 것으로 나타났다.
- 40대에서 은행 상품을 1개, 2개, 4개 사용하는 고객이 차지하는 비율이 30대보다 높은 것으로 나타났다.

# Compare HasCrCard for customers who exited in their 30s and 40s

# Create a 1x2 subplot grid
fig, ax = plt.subplots(1, 2, figsize=(18, 5.5))

# Iterate over age groups
for i, (age_group, exited_data) in enumerate([('30s', exited_30s), ('40s', exited_40s)]):
    # Plot pie chart for Gender in each age group
    exited_data['HasCrCard'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[i], shadow=True)
    ax[i].set_title(f'{age_group}')
    ax[i].set_ylabel('')
    
    # Add legend at the top-left corner
    labels = ['Has Credit Card', 'No Credit Card']
    ax[i].legend(labels, loc='upper left')

# Add text below the center of the figure
fig.text(0.5, 0.05, f'Chi-square = {chi2_crcard:.3f}, p = {p_crcard:.3f}', ha='center', va='center')    
    
# Set the overall title for the entire figure
plt.suptitle('HasCrCard Comparison for Exiting Customers')
plt.show()

HasCrCard
- 30대에서 신용카드를 보유하지 않은 고객이 차지하는 비율이 40대보다 높은 것으로 나타났다.
- 40대에서 신용카드를 보유한 고객이 차지하는 비율이 30대보다 높은 것으로 나타났다.

# Compare IsActiveMember for customers who exited in their 30s and 40s

# Create a 1x2 subplot grid
fig, ax = plt.subplots(1, 2, figsize=(18, 5.5))

# Iterate over age groups
for i, (age_group, exited_data) in enumerate([('30s', exited_30s), ('40s', exited_40s)]):
    # Plot pie chart for Gender in each age group
    exited_data['IsActiveMember'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[i], shadow=True)
    ax[i].set_title(f'{age_group}')
    ax[i].set_ylabel('')
    
    # Add legend at the top-left corner
    labels = ['Not Active Member', 'Active Member']
    ax[i].legend(labels, loc='upper left')

# Add text below the center of the figure
fig.text(0.5, 0.05, f'Chi-square = {chi2_active:.3f}, p = {p_active:.3f}', ha='center', va='center')    

# Set the overall title for the entire figure
plt.suptitle('IsActiveMember Comparison for Exiting Customers')
plt.show()

IsActiveMember
- 30대에서 활성화되지 않은 고객이 차지하는 비율이 40대보다 높은 것으로 나타났다.
- 40대에서 활성화된 고객이 차지하는 비율이 30대보다 높은 것으로 나타났다.

# Compare Balance for customers who exited in their 30s and 40s

# Importing the necessary library
import matplotlib.pyplot as plt

# Calculate mean values for Balance
balance_mean_30s = round(exited_30s['Balance'].mean(), 2)
balance_mean_40s = round(exited_40s['Balance'].mean(), 2)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot histograms for Balance in their 30s and 40s
ax.hist(exited_30s['Balance'], bins=30, alpha=0.5, label='30s')
ax.hist(exited_40s['Balance'], bins=30, alpha=0.5, label='40s')

# Add dashed lines for mean values
ax.axvline(balance_mean_30s, linestyle='dashed', linewidth=2, label=f'Mean (30s): {balance_mean_30s}')
ax.axvline(balance_mean_40s, color='red', linestyle='dashed', linewidth=2, label=f'Mean (40s): {balance_mean_40s}')

# Set title and labels
ax.set_title('Balance Comparison for Exiting Customers')
ax.set_xlabel('Balance')
ax.set_ylabel('')
ax.legend()

# Add text below the center of the figure
ax.text(0.5, -0.15, f't = {t_balance:.3f}, p = {p_balance:.3f}', ha='center', va='center', transform=ax.transAxes)

# Show the plot
plt.show()

Balance
- 30대의 계좌 잔액 평균이 40대보다 높은 것으로 나타났다.

# Compare EstimatedSalary for customers who exited in their 30s and 40s

# Importing the necessary library
import matplotlib.pyplot as plt

# Calculate mean values for EstimatedSalary
salary_mean_30s = round(exited_30s['EstimatedSalary'].mean(), 2)
salary_mean_40s = round(exited_40s['EstimatedSalary'].mean(), 2)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot histograms for EstimatedSalary in their 30s and 40s
ax.hist(exited_30s['EstimatedSalary'], bins=30, alpha=0.5, label='30s')
ax.hist(exited_40s['EstimatedSalary'], bins=30, alpha=0.5, label='40s')

# Add dashed lines for mean values
ax.axvline(salary_mean_30s, linestyle='dashed', linewidth=2, label=f'Mean (30s): {salary_mean_30s}')
ax.axvline(salary_mean_40s, color='red', linestyle='dashed', linewidth=2, label=f'Mean (40s): {salary_mean_40s}')

# Set title and labels
ax.set_title('EstimatedSalary Comparison for Exiting Customers')
ax.set_xlabel('EstimatedSalary')
ax.set_ylabel('')
ax.legend()

# Add text below the center of the figure
ax.text(0.5, -0.15, f't = {t_salary:.3f}, p = {p_salary:.3f}', ha='center', va='center', transform=ax.transAxes)

# Show the plot
plt.show()

EstimatedSalary
- 40대의 예상 급여 평균이 30대보다 높은 것으로 나타났다.

# 결론

30대에서 40대로 넘어갈 때 급격한 이탈률 증가를 보였다.
30대와 40대 이탈 고객의 특성 중 Geography, Gender, NumOfProducts, HasCrCard, IsActiveMember, Balance, EstimatedSalary에서 차이가 나타났다.
이탈률이 높아지는 40대를 타겟으로 마케팅 전략을 세울 때, 다음과 같은 고려사항이 있다.
- 독일 거주자
- 여성
- 1개, 2개, 4개의 은행 상품 이용 고객
- 신용카드 보유 고객
- 활성화된 고객
- 적은 계좌 잔액
- 높은 예상 급여

# 은행 고객의 이탈에 대한 참고자료

https://www.ciokorea.com/news/37389

"돈 되는 고객이 먼저 이탈한다"··· 오픈뱅킹 속 위기의 대형은행

대형 은행이 오픈 뱅킹(Open Banking) 때문에 핵심 고객을 잃게 될 위기에 처했다는 분석이 나왔다. 베인앤드컴퍼니(Bain &

www.ciokorea.com

딥러닝기반의 금융회사 고객이탈 예측모형에 관한 연구.pdf

10.97MB

https://www.hankyung.com/article/2019112919496

"오픈뱅킹 한 달, 고객 이탈 막아라"…은행 '앱'은 발전 중

"오픈뱅킹 한 달, 고객 이탈 막아라"…은행 '앱'은 발전 중, 앱 편의성 높이고 특화 서비스 제공 "소비자 유치보다 이탈 걱정할 처지"

www.hankyung.com

https://www.ebn.co.kr/news/view/1436631

은행, 고객이탈 적신호…"금리 안되니 이벤트라도"

초저금리 탓에 '더 이상 은행에 돈을 묶어놓을 필요가 없다'는 인식이 퍼지고 있다. 시중은행에 고객이탈 현상이 현실화되고 있다. 기준금리가 잇달아 사상 최저수준을 갱신하면서 이미 하락세

www.ebn.co.kr

현재글[PlayGround S4E1] 데이터분석(EDA)

인동머스크

" 우리에게는 존재하지 않는 것들을 꿈꿀 수 있는 사람들이 필요하다. " (ADsP / 빅데이터분석기사 / SQLD) https://github.com/Indongspace

카일스쿨, 통계기초, 빅데이터분석기사, leetcode, 복습, 빅분기, 인프런, vizlab, 코딩테스트, tableau, 오블완, ADsP, 메타코드, hackerrank, 데이터자격검정, 프로그래머스, 태블로, 시각화, mysql, 티스토리챌린지,

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인동머스크

[PlayGround S4E1] 데이터분석(EDA)

'[Kaggle]'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

2025. 04
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30