[머신러닝 코세라 강의] (2주차) "Cost Function & Gradient Descent" Machine Learning (by Andrew Ng)

머신 러닝

[머신러닝 코세라 강의] (2주차) "Cost Function & Gradient Descent" Machine Learning (by Andrew Ng)

마빈 Marvin 2022. 5. 26. 01:14

2주차 Machine Learning (by Andrew Ng) 교수님의 Coursera 강의의 중요 내용을 요약하고, 관련 알고리즘을 직접 파이썬으로 생성해보았습니다.

관련 ipython 코드는 구글 colab 링크에 담아두었습니다.

가설: $h_{\theta} (x) = \theta_0 + \theta_1 x$ 에서 $\theta_0, \theta_1$ 을 선택합니다. 이 때, cost function 인 $J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m \Big( h_\theta (x_i )- y_i \Big)^2$ 을 최소화하는 $\theta_0, \theta_1$ 을 찾습니다.

$$\min_{\theta_0, \theta_1} \frac{1}{m} \sum_{i=1}^m \Big( h_\theta (x_i )- y_i \Big)^2$$

Contour plot 을 그려보기 위해 시뮬레이션을 해보겠습니다. 시뮬레이션 예시를 위해 $y = 2 + 3 \times x$ (즉, $\theta_0 = 2, \theta_1 = 3$) 인 $x$ 와 $y$ 를 뽑습니다.

import numpy as np
from scipy import stats as st
from scipy.stats import norm

mu = 10
sigma = 0.25

theta0 = 2
theta1 = 3
n = 10000

x = [0]*n
y = [0]*n
for s in range(n):
  x[s]=np.random.normal(mu,sigma)
  y[s]= theta0 + theta1 * x[s] 
u = np.random.normal(0,0.1,n)
x = np.array(x)
y = np.array(y)
y = y+u

import matplotlib.pyplot as plt
plt.scatter(x,y)

model = np.polyfit(x, y, 1)
model

는 array([3.00186546, 1.98162399]) 를 출력합니다. 원하는 파라미터 값이 출력되었습니다. 다음으로, cost function 을 정의한 후에, $\theta_0$ 와 $\theta_1$ 의 범위에 대해서, cost function 의 값을 구하고, 이를 그래프로 나타냅니다.

def J(b,a) :
  sum = 0
  for s in range(n):
    sum = sum + (b+x[s]*a-y[s])**2
  return sum *1/(2*n)
  
pointsTheta0 = np.arange(1.95,2.05,0.01)
pointsTheta1 = np.arange(2.95,3.05,0.01)
axisTheta0, axisTheta1 = np.meshgrid(pointsTheta0, pointsTheta1)
Z = J(axisTheta0,axisTheta1)

fig = plt.figure(figsize = (12,10))
ax = plt.axes(projection='3d')

surf = ax.plot_surface(axisTheta0, axisTheta1, Z, cmap = plt.cm.cividis)

# Set axes label
ax.set_xlabel('theta0', labelpad=20)
ax.set_ylabel('theta1', labelpad=20)
ax.set_zlabel('J(theta0,theta1)', labelpad=20)

fig.colorbar(surf, shrink=0.5, aspect=8)

plt.show()

아래와 같은 그래프가 나오는데, $(\theta_0, \theta_1)$ 에 대한 cost function 의 값이 육안으로 확인할 때, $\theta_0$ 에 대해서는 차이가 잘 보이지 않습니다. $\theta_1$ 에 대해서는 차이가 보입니다.

등고선을 구현해봅니다.

v = np.linspace(0.004, 0.02, 10, endpoint=True)
# plt.contour(axisTheta0, axisTheta1, Z)
plt.contourf(axisTheta0, axisTheta1, Z, v, cmap=plt.cm.jet)
plt.colorbar();

위의 코드는 아래와 같은 등고선을 출력합니다. (등고선을 육안으로 확인할 때) $\theta_0$ 값은 1.97 ~ 2.03 사이에서 cost function 이 비슷하게 나오고, $\theta_1$ 값은 cost function 이 거의 동일합니다.

다음으로, gradient descent 알고리즘을 이용해서 $\theta$ 값들을 찾아보도록 하겠습니다. 아래 $\theta$ 값들이 수렴할 때 까지 반복합니다.

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$

for $j=0$ and $j=1$.

이 때, $\theta$ 값들의 업데이트를 동시에 합니다.

xmean=x.mean(dtype=float)
ymean=y.mean(dtype=float)
xstd=x.std(dtype=float)
ystd=y.std(dtype=float)
x.mean(),x.std(),y.mean(),y.std()

(x 평균, x 표준편차, y 평균, y 표준편차) = (10.00088584317288, 0.248813713074603, 32.000859006885335, 0.7533594924594834) 가 나옵니다.

x 와 y 값들을 normalize 를 먼저 해줍니다. x 와 y 의 단위가 다르면 $\theta_0$ 와 $\theta_1$ 을 동시에 찾을 때, error 값을 특정짓기가 어렵기 때문입니다.

x=(x-x.mean())/x.std()
y=(y-y.mean())/y.std()

x 와 y 값들을 normalize 해줍니다. 평균과 표준편차가 모두 1입니다.

다음으로, gradient descent algorithm 을 적용하기 위해서 $-\alpha \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1)$ 을 적용합니다.

$j=0: \frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)= \frac{1}{m} \sum_{i=1}^m \Big( h_\theta (x_i )- y_i \Big) $

$j=1: \frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)= \frac{1}{m} \sum_{i=1}^m \Big( h_\theta (x_i )- y_i \Big) x_i$

def update0(b,a) :
  sum = 0
  for s in range(n):
    sum = sum + (b+x[s]*a-y[s])
  return sum *1/(n)

def update1(b,a) :
  sum = 0
  for s in range(n):
    sum = sum + (b+x[s]*a-y[s])*x[s]
  return sum *1/(n)

다음으로, 위 함수를 바탕으로 gradient descent 알고리즘을 적용해서 $\theta$ 값들을 찾습니다.

predTheta0=4
predTheta1=4
prevTheta0=0
prevTheta1=0
im_predTheta0=1
im_predTheta1=1
alpha= 10**(-3)
error = 10**(-5)

while (abs(predTheta0-prevTheta0)>error or abs(predTheta1-prevTheta1)>error):
  prevTheta0 = predTheta0
  prevTheta1 = predTheta1
  im_predTheta0=predTheta0-alpha*update0(predTheta0,predTheta1)
  im_predTheta1=predTheta1-alpha*update1(predTheta0,predTheta1)
  predTheta0=im_predTheta0
  predTheta1=im_predTheta1

True: $y = \theta_1 x + \theta_0$

$$y' = a x' + b$$

where $y' = \frac{y-\mu_y}{\sigma_y}$ and $x'=\frac{x-\mu_x}{\sigma_x}$ (참값은 $y$ 와 $x$ 인데, $y'$ 과 $x'$ 은 참값을 normalize 했기 때문입니다. $a$ 와 $b$ 는 파이썬 코드에서 각각 predTheta1 과 predTheta1 입니다.).

Then,

$$ \frac{y-\mu_y}{\sigma_y} = a \times \Big( \frac{x-\mu_x}{\sigma_x} \Big) + b$$

Therefore,

$$ y = a \frac{\sigma_y}{\sigma_x} x + \Big( \mu_y - a \frac{\sigma_y}{\sigma_x} \mu_x + \sigma_y b \Big) $$

따라서, $\theta_1 = a \frac{\sigma_y}{\sigma_x} $ 이고, $\theta_0 = \mu_y - a \frac{\sigma_y}{\sigma_x} \mu_x + \sigma_y b$ 입니다. 이를 적용해서 계산해보면,

estTheta1=predTheta1*ystd/xstd
estTheta0=ymean-predTheta1*(ystd/xstd)*xmean+ystd*predTheta0

(estTheta0, estTheta1) = (1.7672025480851359, 3.0238499860834755) 가 나옵니다. 실제값에 제법 가까운 값이 나옵니다. 오차를 제어하려면 어떻게 해야하는 지는 더 고민해야할 것 같습니다 (error 를 더 작은 값으로하면 값이 더 잘 나오지 않을까 싶네요).

Reference

- Gradient Descent, the Learning Rate, and the importance of Feature Scaling

- 미디엄 Gradient descent method.