Neural network(1)---Logistic Regression

Posted on 2025-11-28 Edited on 2021-08-24 In Neural Network Valine:

After a long time ,I decide to study Neural Network in a systematic way

Lead to

When I first talk about this topic, I want to use a Mathematical way to explain the algorithm---the Logistic Regression.fisrtly , I want to introduce this algorithm to you ,it is a tool to solve binary classification problem,firstly ,for a binary classification problem you may recall the function \(y=wx+b\),that is linear regression which is the simplest binary classification function ,if you have got the \(w,b\),and you input a \(x\),you will get a \(\hat y\), and if the actual value of \(y\) corresponding to \(x\) is bigger then \(\hat y\),or is smaller then \(\hat y\),you will use this way to get two two categories in much data

Overview of logistic regression

Today we will introduce a new algorithm to solve this problem: \[ \left\{\begin{array}{l} z=w^{\top} x+b \\ \hat{y}=a=\sigma(z) \\ L(a, y)=-(y \log (a)+(1-y) \log (1-a)) \end{array}\right. \]

you give the x and y ,you need give the initialization of \(w,b\) and program will help you to solve the \(\hat y\) and the loss function \(L(a,y)\) ,and the function \(\sigma(z)\) is \[ \sigma(z)=\frac{e^z}{e^z+1} \]

\(L\) is just like the \(\frac{1}{2}(y-\hat y)^2\),which represent how closer two points is ? if \(L\) is very small ,which represent the points is very closer. But why we don't choose \(\frac{1}{2}(y-\hat y)^2\),because if we choose this function ,the loss function is maybe not a convex function ,which exist many local minimize solve ,just like this picture :

when the loss function is non-convex ,the gradient can't work very well ,so we choose the another function as the loos function ,So why \(L(a,y)\) can be used to describe the distance ? just see two picture:

the two picture is the \(L(a,y)\),when I assume that y=0 and y=1 and the value of \(L\) is dependent on the value of a,

with different a,the \(L\) have different value ,when \(y=0\), only when \(a=0\),then the Probability also the \(L\) is equal to 1

1 represent True,0 represents False,so this is the reason why we choose this function to represents our "Loss function".

But unfortunately ,this value can't represent the overall performance on the model, so we need another function to reflect the accuracy of model ,so we introduce the \(J\) ,which is mean "Cost function" ,Its expression is : \[ J(w,b)=-\frac{1}{m}\sum_{i=1}^{m}(L(a^{(i)},y^{(i)})) \]

\[ dz=\frac{dl}{dz}=\frac{dl}{da}\cdot\frac{da}{dz}\\da=-\frac{y}{a}+\frac{1-y}{1-a} \]

so we can get : \[ \begin{equation}\begin{aligned}dz &=\frac{dl}{da}\cdot\frac{da}{dz}\\&=(-\frac{y}{a}+\frac{1-y}{1-a})\cdot \frac{e^z}{(e^z+1)^2}\\&=a(1-a)(-\frac{y}{a}+\frac{1-y}{1-a})\\&=a-y\end{aligned}\nonumber\end{equation} \] ok, we get the \(dz\),then we can get some other import information like this : \[ dw_1=\frac{\partial L}{\partial w_1}=x_1dz \qquad db =dz \] then we can use Gradient descent to update the \(w,b\): \[ \cases{w=w-\alpha w\\b=b-\alpha b} \] in this expression \(a^{(i)}=\hat y^{(i)}=\sigma(z^{(i)})=\sigma(w^Tx^{(i)+b})\).and we can calculate the derivative of \(J\) with respect to \(w\) and \(b\):

\[ \frac{\partial}{\partial w_i}J(w,b)=\frac{1}{m}\sum_{i=1}^{m}\frac{\partial}{\partial w_i}(L(a^{(i)},y^{(i)})) \] you can see : \[ \frac{\partial}{\partial w_i}(L(a^{(i)},y^{(i)}))=d w_{i}^{(i)}-\left(x^{(i)} \cdot y^{(i)}\right) \] so forward propagation just like this :

J=0 dw_1=0 dw_2=0 db=0
For  i=0  to m:
	z_i=w_T*x_i + b   #w_T is the transpose of w
	a_i=sigmoid(z_i)
	J+=-[y_i*log(a_i)+(1-y_i)*log(1-a_i)]
	dz_i=a_i - y_i
	dw_i +=x_1i*dz_i
	dw_2 +=x_2i*dz_i
	db+=dz_i
	J/=m
	dw_1/=m   dw_2/=m   db=m
	w_1=w_1-learning_rate*dw_1 w_2=w_2-learning_rate*dw_2 b=b-learning_rate*db

Vectorization

whenever possible avoid explicit for loops

For loop:
	U_i=sum(A_i,V_i)
	U=np.zeros((n,1))
	for i in ...n:
		for j .....m:
			u[i]+=A[i][j]*V[i]

if we use python-numpy:

1 2	import numpy as np u=np.dot(A,V)

python-numpy's syntax is simpler, and calculation is more efficient,That is pretty good,deep learning need a lots of calculations ,so matrix syntax by numpy is recommended

Coding

Building this model neural network , let's do from the structure

read the data and define the sigmoid function
use data and sigmoid function to propagate the \(J\)
use \(J\) to backpagate \(dw\) and \(db\)
set number of iteration and learning rate to update the \(J\) (makes it smaller)
predict the train data and the test data ,get the accuracy of model's prediction
plot the \(J-n_{iteration}\) figure,see whether it is overfit or not ?

import lib and read data

import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset
%matplotlib inline
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()
train_set_y.shape
index =2
plt.imshow(train_set_x_orig[index])
print("y = "+str(train_set_y[:,index]))
m_train=train_set_x_orig.shape[0]
m_test=test_set_x_orig.shape[0]
num_px=train_set_x_orig.shape[1]
print("x_train:",m_train,"x_test:",m_test,"shape:",num_px)
print("y_train:",)
# print(train_set_x_orig[0].shape)
train_set_x_flatten=train_set_x_orig.reshape(m_train,-1).T   #flatten the image ,just want to make it 3*width*height
# print(train_set_x_flatten.shape)
test_set_x_flatten=test_set_x_orig.reshape(m_test,-1).T

##data normalization

1 2	train_set_x=train_set_x_flatten/255 test_set_x=test_set_x_flatten/255

define sigmoid

def sigmoid(Z):     #define sigmoid function
    s=1/(1+np.exp(-Z))
    return s
print(sigmoid(np.array([1,2])))  #test sigmoid fuction

def initialize_with_zeros(dim):      #Column vector
    w=np.zeros((dim,1))
    b=0
    assert(w.shape==(dim,1))
    assert(isinstance(b,float) or isinstance(b,int))
    return w,b

propagation

def Propagation(w,b,X,Y):
    m=X.shape[1]
    Z=np.dot(w.T,X)+b
    A=sigmoid(Z)       #pay attention to 
    J=1/m*np.sum(-(Y*np.log(A)+(1-Y)*np.log(1-A)))
    dz=A-Y
    dw=(1/m)*np.dot(X,dz.T)
    db=(1/m)*np.sum(dz)
    grads={"dw":dw,
            "db":db
          }
    return grads,J
w,b,X,Y = np.array([[1],[2]]),2.,np.array([[1,2,-1],[3,4,-3.2]]),np.array([[1,0,1]])
grads,J=Propagation(w,b,X,Y)
print("dw=",str(grads['dw']))
print("db=",str(grads['db']))
print('J=',str(J))

define optimize

def Propagation(w,b,X,Y):
    m=X.shape[1]
    Z=np.dot(w.T,X)+b
    A=sigmoid(Z)       #pay attention to 
    J=1/m*np.sum(-(Y*np.log(A)+(1-Y)*np.log(1-A)))
    dz=A-Y
    dw=(1/m)*np.dot(X,dz.T)
    db=(1/m)*np.sum(dz)
    grads={"dw":dw,
            "db":db
          }
    return grads,J
w,b,X,Y = np.array([[1],[2]]),2.,np.array([[1,2,-1],[3,4,-3.2]]),np.array([[1,0,1]])
grads,J=Propagation(w,b,X,Y)
print("dw=",str(grads['dw']))
print("db=",str(grads['db']))
print('J=',str(J))

define predict

def predict(w,b,X):
    m=X.shape[1]
    Y_predict=np.zeros((1,m))
    A=sigmoid(np.dot(w.T,X)+b)
    for i in range(A.shape[1]):
        if A[0,i]>0.5:
            Y_predict[0,i]=1
    return Y_predict

model

def model(X_train,Y_train,X_test,Y_test,num_iterations=2000,learning_rate=0.5,print_cost=False):
    w,b=initialize_with_zeros(X_train.shape[0])
    parameters,grads,costs=optimize(w,b,X_train,Y_train,num_iterations,learning_rate,print_cost)
    w =parameters['w']
    b =parameters['b']
    Y_predict_train=predict(w,b,X_train)
    Y_predict_test=predict(w,b,X_test)
    print("train accuracy:{}".format(100-np.mean(np.abs(Y_predict_test-Y_test))*100))
    print("train accuracy:{}".format(100-np.mean(np.abs(Y_predict_train-Y_train))*100))
    d={"costs":costs,
      "Y_predict_test":Y_predict_test,
       "Y_predict_train":Y_predict_train,
       "w":w,
       "b":b,
       "learning_rate":learning_rate,
       "num_iterations":num_iterations      
      }
    return d

1
2
3

#test model 
d=model(train_set_x,train_set_y,test_set_x,test_set_y,num_iterations=2000,learning_rate=0.005,print_cost=True)
print("train",d["Y_predict_test"])