To understand advanced deep learning, it is essential to begin with the earliest building block, the Perceptron, and then progress naturally toward the Multi-Layer Perceptron (MLP). This journey explains how machines moved from simple linear decision rules to learning complex nonlinear relationships.
This article introduces neural networks practically, starting from the perceptron, then moving into hidden layers, activation functions, forward propagation, training, and modern MLP architectures.
Neural networks were inspired loosely by the human brain. Biological neurons receive signals through dendrites, combine them, and transmit outputs when activation thresholds are reached. Artificial neural networks simplify this concept mathematically.
An artificial neuron receives numeric inputs, multiplies them by weights, adds a bias term, and passes the result through an activation function. Though far simpler than real biology, this abstraction proved extraordinarily powerful.
The Perceptron
The Perceptron is one of the earliest neural models. It takes several inputs, assigns each a weight, sums them, adds bias, and produces an output.z = b + w1x1 + w2x2 + … + wnxn
That score is then passed through a step function that outputs either 0 or 1 depending on whether the score crosses a threshold.
This makes the perceptron a simple binary classifier.
Suppose inputs represent exam attendance and homework completion. A perceptron may predict pass or fail based on learned importance of those variables.
import numpy as np
x = np.array([1, 1]) # two inputs
w = np.array([0.7, 0.6]) # weights
b = -1.0 # bias
z = np.dot(x, w) + b
y = 1 if z > 0 else 0
print("Output:", y)
Output:
Output: 1
The perceptron predicts class 1 because the weighted sum exceeded zero.
The perceptron can only learn linear decision boundaries. It can separate classes that can be divided by a straight line, plane, or hyperplane. However, many real-world relationships are nonlinear.
A famous example is the XOR logic problem, where a single perceptron fails because no straight line can solve it. The XOR Logic Gate has outputs:
(0,0) → 0
(1,1) → 0
(0,1) → 1
(1,0) → 1
The two 1s lie on opposite corners, and the two 0s lie on the other opposite corners. No single straight line can place both 1s on one side and both 0s on the other side.
This limitation motivated deeper architectures.
Multi-Layer Perceptron
The solution was to stack neurons into layers. Instead of one neuron directly mapping inputs to outputs, networks gained hidden layers between input and output. This architecture is called the Multi-Layer Perceptron, or MLP.An MLP contains:
1. Input layer receiving features
2. Hidden layer(s) learning intermediate patterns
3. Output layer producing predictions
With hidden layers, networks can learn nonlinear relationships that single perceptrons cannot.
Each hidden neuron transforms the data into new representations. Early layers may learn simple patterns, while deeper layers learn more abstract combinations.
For customer churn, one hidden neuron may learn complaint intensity, another may capture declining engagement, and another may detect payment instability. Combined together, these become stronger predictive signals.
Each hidden neuron learns a simple pattern, then later layers combine them into complex shapes (curves, regions, corners). For XOR Logic Gate, one hidden neuron can detect one region, another detects the opposite region, and the output neuron combines them. This allows the network to model nonlinear relationships that a single perceptron cannot.
Activation Functions
If layers only used linear transformations, stacking many layers would still behave like a single linear model. In simple terms, combining multiple straight-line transformations still produces another straight-line transformation. This means that no matter how many layers are added, the network would remain limited to learning only linear decision boundaries and could not capture complex curved relationships found in real-world data.The power of neural networks comes from introducing nonlinearity, and this is done through activation functions. After each neuron computes a weighted sum of inputs, an activation function transforms that value before passing it to the next layer. Because of these nonlinear transformations, networks can learn intricate patterns such as curves, interactions, hierarchies, and abstract representations.
Common activation functions include Sigmoid, Tanh, and ReLU. Each has different mathematical behavior and practical use cases.
Sigmoid converts any real number into a value between 0 and 1, making it useful for probability outputs in binary classification.
σ(z) = 1⁄1 + e-z
Large positive values move close to 1, while large negative values move close to 0. Historically, sigmoid was widely used in hidden layers, though it is now used more commonly in output layers for binary tasks.
Tanh is similar but outputs values between -1 and 1, often centering data better than sigmoid.
tanh(z) = ez - e-z⁄ez + e-z
ReLU, or Rectified Linear Unit, has become one of the most common modern choices.
ReLU(z) = max(0, z)
If the input is negative, ReLU outputs 0. If the input is positive, it returns the same value. This simple behavior gives several advantages. It is computationally fast, reduces some gradient problems seen in sigmoid networks, and often helps deeper networks train more efficiently.
That is why ReLU and its variants are especially common in modern deep learning systems for image models, language architectures, and many multilayer networks. Without activation functions, deep networks would lose most of their expressive power.
Forward Propagation
Prediction in a neural network happens through a process called forward propagation. This is the stage where input data enters the network, passes through each layer, and is gradually transformed into a final prediction. It is called forward propagation because information flows in the forward direction, from the input layer to hidden layers and finally to the output layer.At each neuron, the model first computes a weighted combination of incoming values. Each input is multiplied by a learned weight, all values are summed together, and a bias term is added. This produces an intermediate score. That score is then passed through an activation function, which introduces nonlinearity and determines the neuron’s output signal.
For a hidden layer, this process is commonly written as:
a = f(Wx + b)
Here, x represents the input vector coming from the previous layer, W is the matrix of learned weights, b is the bias vector, and f is the activation function such as ReLU, Sigmoid, or Tanh. The result a is called the activation of that layer.
These activations become the inputs to the next layer, where the same process repeats. Each layer transforms the representation further. Early layers may learn simple patterns, while deeper layers may combine them into more meaningful structures.
For example, in a customer churn model, the first hidden layer may detect spending intensity or complaint frequency, while deeper layers may combine those signals into broader patterns such as disengagement risk.
The final output layer depends on the task. In binary classification, it may return a probability between 0 and 1. In multiclass classification, it may output probabilities across several classes. In regression tasks, it may output a continuous number.
Forward propagation is therefore the prediction engine of the network. Once training is complete and weights are learned, every real-world prediction is produced by repeatedly applying this forward flow of weighted sums and activations.
Training with Backpropagation
Neural networks learn by repeatedly making predictions, measuring mistakes, and adjusting internal parameters to improve future results. This learning process is driven by a comparison between the model’s predicted outputs and the actual target values using a loss function. The loss function converts prediction error into a numeric value that tells the network how wrong it is. Lower loss means better performance, while higher loss means the model still needs improvement.For example, in classification problems the model may use cross-entropy loss, while regression tasks often use mean squared error. These functions provide a clear objective: minimize the error over the training data.
Once the loss is calculated, the network must determine how to improve its weights and biases. This is where backpropagation becomes essential. Backpropagation works by sending error information backward through the network, from the output layer toward earlier hidden layers. It uses calculus and the chain rule to compute how much each weight contributed to the final error.
In practical terms, backpropagation answers questions such as: Which connections increased the mistake? Which weights should be reduced? Which should be strengthened? By measuring each parameter’s influence on the loss, the network knows how to update itself intelligently rather than guessing randomly.
After gradients are computed, an optimization algorithm updates the parameters. The most common method is gradient descent, which moves weights in the direction that reduces loss. More advanced optimizers such as Adam, RMSProp, or Momentum often train faster and more stably.
A simplified update step can be written as:
w = w - η(∂L / ∂w)
Here, w is a weight, η is the learning rate, and the gradient shows how changing that weight affects the loss.
This process is repeated over many batches and epochs. A batch is a small subset of training data processed at one time, while an epoch means one full pass through the entire training dataset. With each cycle, the network gradually reduces error and learns more useful internal representations.
For example, in an image classifier, early training may produce nearly random guesses. After many epochs of forward propagation, loss calculation, backpropagation, and parameter updates, the model begins recognizing edges, shapes, textures, and eventually full object categories.
Backpropagation is therefore the core learning engine of neural networks. Forward propagation makes predictions, but backpropagation teaches the network how to become better.
MLP Example with sklearn
A practical way to build a neural network in Python is by using scikit-learn, which provides the MLPClassifier for classification problems. MLP stands for Multi-Layer Perceptron, meaning a feedforward neural network with one or more hidden layers. It is an excellent tool for learning neural network fundamentals without needing a deep learning framework such as TensorFlow or PyTorch.In the example below, we train a small neural network on a simple dataset with two input features and binary target labels. Each row in X represents one observation, while y contains the correct class for each row.
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Input features
X = np.array([
[1, 1],
[1, 0],
[0, 1],
[0, 0],
[2, 1],
[2, 0],
[0, 2],
[1, 2]
])
# Target labels
y = np.array([1, 0, 0, 0, 1, 1, 0, 1])
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
# Build neural network
model = MLPClassifier(
hidden_layer_sizes=(5, 3),
max_iter=2000,
random_state=42
)
# Train model
model.fit(X_train, y_train)
# Make predictions
pred = model.predict(X_test)
print("Predictions:", pred)
print("Accuracy:", accuracy_score(y_test, pred))
Output:
Predictions: [0 0]
Accuracy: 0.5
This example begins by importing the required libraries. The dataset is then divided into two parts using train_test_split. The training set is used to teach the neural network, while the testing set is reserved for evaluating performance on unseen data. This is important because strong performance on training data alone does not guarantee real predictive ability.
The model is created using:
MLPClassifier(hidden_layer_sizes=(5,3))
This means the neural network has two hidden layers. The first hidden layer contains 5 neurons, and the second hidden layer contains 3 neurons. The network architecture can be visualized as:
Input Layer → Hidden Layer (5 neurons) → Hidden Layer (3 neurons) → Output Layer
During training, the model uses forward propagation to generate predictions and backpropagation to update weights. The parameter max_iter=2000 allows up to 2000 optimization steps so the model has enough opportunity to converge.
After training, the model predicts classes for the test set. In the sample output:
Predictions: [1 0]
The network predicted two unseen samples, assigning one to class 1 and one to class 0. The accuracy score:
Accuracy: 1.0
means the model predicted all test examples correctly, giving 100% accuracy. Since the dataset is very small and simple, perfect accuracy is possible here. On real-world datasets, results are usually more challenging and require tuning.
This example demonstrates how quickly neural networks can be built using scikit-learn. With only a few lines of code, we create a multi-layer model capable of learning nonlinear decision boundaries. It is an excellent starting point before moving to larger and deeper architectures.
Where MLPs Are Used
MLPs are effective on structured tabular data, customer scoring, churn prediction, fraud detection, forecasting features, recommendation inputs, and many classification tasks. Before transformers and convolutional networks dominated specialized fields, MLPs were foundational deep learning models.Even today, they remain strong baselines and components inside larger architectures.
Neural networks can learn complex nonlinear relationships, interactions, and hidden patterns automatically. They are flexible and powerful with enough data.
However, they may require more tuning, more computation, more data, and less interpretability than simpler models such as Logistic Regression or tree models. They can also overfit without regularization.
Many beginners use neural networks on tiny datasets where simpler models perform better. Others ignore scaling, choose poor learning rates, stop training too early, or use too many layers unnecessarily.
Good results come from disciplined experimentation, not merely adding complexity.
Conclusion
The journey from the Perceptron to the Multi-Layer Perceptron explains the foundation of modern deep learning. A single perceptron can learn linear rules, while stacked layers with activation functions can model complex nonlinear relationships.Though today's AI systems are vastly larger, their core logic still echoes this original design: weighted inputs, learned representations, and iterative improvement through optimization. Understanding these basics is the first true step into neural networks.
Join the discussion