[ML] MNIST_Hand Digit with CrossEntropy and Matrix for

2025. 8. 12. 13:04python/ML

In the previous post, we trained an MNIST network with the quadratic (MSE) cost and a per-sample backprop loop.
In this post, we switch to the Cross-Entropy (CE) cost and a mini-batch (vectorized) update.

cross entropy function

 

With sigmoid + CE (binary) or softmax + CE (multiclass), the output error is

There is no extra σ′(z) factor multiplying the error at the output layer.
This avoids the severe gradient shrinkage you often get with MSE + sigmoid, where
δ(L)=(a(L)−y)⊙σ′(z(L)) can be near zero when the neuron saturates.

 

 

Gradients (Mini-Batch, Vectorized)

Setup
Let the mini-batch size be m samples.
For each layer l:

  • A(l)∈Rnl×m activations
  • Z(l)∈Rnl×m pre-activations
  • W(l)∈Rnl×nl−1 weights
  • b(l)∈Rnl×1 biases

1. Forward pass 

For l = 1 to L: 

Z[l] = W[l] @ A[l-1] + b[l]        # b[l] is broadcast over columns
A[l] = sigmoid(Z[l])               # for hidden layers
A[L] = softmax(Z[L])                # for output layer with multiclass CE

 

2. Output error

For sigmoid + CE (binary) 

delta[L] = A[L] - Y

 

3. Hidden Layer errors 

For l = L - 1 down to 1 : 

delta[l] = (W[l+1].T @ delta[l+1]) * sigmoid_prime(Z[l])

 

4. Gradients (Averaged over Mini-Batch) 

For each layer l: 

grad_W[l] = (1/m) * delta[l] @ A[l-1].T
grad_b[l] = (1/m) * np.sum(delta[l], axis=1, keepdims=True)

 

5. Parameter Update 

With learning rate eta: 

W[l] = W[l] - eta * grad_W[l]
b[l] = b[l] - eta * grad_b[l]

 

 

All code : 

#%%
import numpy as np 
import random 

def sigmoid(z):
    return 1 / (1 + np.exp(-z)) 

def sigmoid_prime(z):
    s = sigmoid(z) 
    return s * (1 - s) 


class Network2():
    def __init__(self,sizes): 
        self.sizes = sizes
        self.num_layers = len(sizes) 
        self.biases = [np.random.randn(y,1) for y in sizes[1:]] 
        self.weights = [np.random.randn(y,x) for x,y in zip(sizes[:-1],sizes[1:])]
        
    def feedforward(self, mini_batch): 
        # first : batch list --> batch numpy with hstack  
        X = np.hstack([x for x,_ in mini_batch]) # shape of X : (features, mini_batch)
        Y = np.hstack([y for _,y in mini_batch]) # shape of Y : (output_features, mini_batch) 
        
        A = X
        for W,b in zip(self.weights , self.biases):
            z = W @ A + b
            A = sigmoid(z)
        return A
    
    def SGD(self, training_data, mini_batch_size, epochs, eta, test_data = None):
        if test_data: n_test = len(test_data) 
        n = len(training_data) 
        for epoch in range(epochs):
            random.shuffle(training_data)
            mini_batches = [training_data[k:k+mini_batch_size] for k in range(0,n,mini_batch_size)] 
            
            for mini_batch in mini_batches: 
                self.update_mini_batch(mini_batch, eta, "ce") 
            
            if test_data:
                print(f"Epoch {epoch} : {self.evaluate(test_data,mini_batch_size)} | {n_test}")
            else:
                print(f"Epoch {epoch} complete.")
                
    def update_mini_batch(self, mini_batch, eta = 0.1, loss = 'ce'):
        X = np.hstack([x for x,_ in mini_batch]) 
        Y = np.hstack([y for _,y in mini_batch])
        m = X.shape[1]
        
        # Forward pass 
        As = [X] 
        Zs = [] 
        A = X 
        for W,b in zip(self.weights, self.biases):
            Z = W @ A + b 
            A = sigmoid(Z) # (l layer neruon, m)
            Zs.append(Z) 
            As.append(A) 
        
        if loss == "ce":
            delta = As[-1] - Y # (l layer neuron, 1) --> sum <--- 
        elif loss == "mse":
            delta = (As[-1] - Y) * sigmoid_prime(Zs[-1])
        else:
            raise ValueError("loss must be ce(cross_entropy) or MSE") 
        
        # Gradients for last layer 
        nabla_b_L = np.sum(delta, axis = 1, keepdims = True) / m 
        nabla_w_L = (delta @ As[-2].T) / m 
        
        nabla_bs = [None] * len(self.biases) 
        nabla_ws = [None] * len(self.weights) 
        nabla_bs[-1] = nabla_b_L 
        nabla_ws[-1] = nabla_w_L
        
        for l in range(2, self.num_layers): 
            Z = Zs[-l] 
            sp = sigmoid_prime(Z) 
            delta = (self.weights[-l + 1].T @ delta) * sp # shape : (l-1 layer neuron, m)
            nabla_b = np.sum(delta, axis=1, keepdims = True) / m 
            nabla_w = (delta @ As[-l-1].T) / m 
            
            nabla_bs[-l] = nabla_b 
            nabla_ws[-l] = nabla_w 
            
        for i in range(len(self.weights)):
            self.weights[i] -= eta * nabla_ws[i] 
            self.biases[i] -= eta * nabla_bs[i]
            
    def evaluate(self, test_data, mini_batch_size = 128):
        n = len(test_data)
        correct = 0 
        for k in range(0, n , mini_batch_size):
            batch = test_data[k : k + mini_batch_size] 
            Y_hat = self.feedforward(batch) 
            Y_true = np.hstack([y for _, y in batch]) 
            pred = np.argmax(Y_hat, axis = 0) 
            true = np.argmax(Y_true, axis= 0) 
            correct += np.sum(pred == true)
        return correct

 

souce code : http://neuralnetworksanddeeplearning.com/chap3.html

 

Neural networks and deep learning

Question: How do you approach utilizing and researching machine learning techniques that are supported almost entirely empirically, as opposed to mathematically? Also in what situations have you noticed some of these techniques fail? Answer: You have to re

neuralnetworksanddeeplearning.com

 

'python > ML' 카테고리의 다른 글

[ML_8] Ensemble : Voting , Bagging , Boosting(+)  (0) 2025.08.22
[ML] MNIST_Hand written code  (0) 2025.08.09
[ML] Bayesian Concept learning  (0) 2025.08.04
[Probability] Bayes Rule  (0) 2025.08.03
[Linear_algebra] Null space  (0) 2025.08.01