How to implement neural networks with tf.keras in Python

Before you even touch a line of code, you need to get your environment set up right. This isn’t glamorous, but it’s the foundation for everything else. If your tools are a mess, your code will be too, and debugging will become a horror story.

Start with Python 3.8 or later-don’t settle for anything older. It’s not just about syntax sugar; newer versions have optimizations and libraries that make your life easier. Next, use venv or virtualenv to isolate your project dependencies. Nothing kills productivity faster than dependency conflicts.

Once your virtual environment is ready, install the essentials. For machine learning, that usually means numpy, pandas, and scikit-learn. If you plan on doing anything neural network-related, tensorflow or pytorch should be in the mix. But don’t just blindly install everything-pick what you actually need.

python3 -m venv ml-env
source ml-env/bin/activate
pip install numpy pandas scikit-learn

Why these libraries? NumPy handles arrays and math like a champ, pandas is your go-to for data manipulation, and scikit-learn offers a consistent API for classical machine learning algorithms. You want your tools to play nicely together, and this trio is battle-tested.

Don’t forget your IDE or text editor. VS Code or PyCharm are solid picks. Configure your linter (like flake8) and formatter (black) early on. This might feel like busywork, but once you’re deep in the code, these will save you hours of headaches.

Here’s a quick snippet to check your environment sanity:

import sys
import numpy
import pandas
import sklearn

print(f"Python version: {sys.version}")
print(f"NumPy version: {numpy.__version__}")
print(f"Pandas version: {pandas.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")

If your versions don’t show up or you get import errors, you’re not ready. Fix that first-there’s no point in moving forward until your foundation is solid. Remember, the goal is to get to coding, not wrestling your setup.

One last thing: keep your data organized. Create a data/ folder, and don’t scatter CSVs all over the place. Your script should clearly point to data paths relative to the project root, not some absolute path on your desktop. This makes your project portable and reproducible.

Once your environment, libraries, and folder structure are sorted, you’re ready to start building your first model. But before that, it helps to understand what a model really is under the hood-spoiler: it’s a stack of lego bricks, not magic.

Hiearcool Waterproof Phone Pouch, IPX8 Waterproof Phone Case for iPhone 17/16/15/14 Pro Max, Cell Phone Dry Bag for Beach & Cruise Essentials, Touch Screen, Lanyard Included, 2 Pack (Fits Up to 8.9”)

(465111427)

$19.97 (as of June 19, 2026 07:52 GMT +00:00 - )

Your first model is just a stack of lego bricks

At its core, a machine learning model is just a function. It takes input data, does some math, and spits out predictions. The “learning” part is about adjusting the knobs inside that function so it makes better predictions over time. Think of those knobs as parameters-numbers the model tunes to fit your data.

Let’s start with something simple: linear regression. This is the classic example where your model tries to fit a straight line through your data points. The line is defined by two parameters, slope (m) and intercept (b), and your job is to find their best values.

Here’s how you might write a bare-bones linear regression model from scratch. No magic libraries, just Python and NumPy:

import numpy as np

class LinearRegression:
    def __init__(self):
        self.m = 0  # slope
        self.b = 0  # intercept

    def predict(self, X):
        return self.m * X + self.b

    def fit(self, X, y):
        # Closed-form solution (normal equation)
        X_mean = np.mean(X)
        y_mean = np.mean(y)

        numerator = np.sum((X - X_mean) * (y - y_mean))
        denominator = np.sum((X - X_mean) ** 2)

        self.m = numerator / denominator
        self.b = y_mean - self.m * X_mean

This is your first model-two numbers that define a line. The fit method calculates those numbers by minimizing the squared difference between predicted and actual values. This “closed-form” solution is neat because it doesn’t require iteration, but it only works for simple linear regression.

Try it out with some data:

X = np.array([1, 2, 3, 4, 5])
y = np.array([3, 5, 7, 9, 11])

model = LinearRegression()
model.fit(X, y)

print(f"Slope: {model.m}")
print(f"Intercept: {model.b}")

predictions = model.predict(X)
print("Predictions:", predictions)

Notice how the model learned that the slope is 2 and the intercept is 1, which matches the pattern y = 2x + 1. This is exactly what you expected, because the data was generated that way.

Models get more complex, but the building blocks stay the same: parameters (the knobs), input features (the data), and a function that maps inputs to outputs. You stack those bricks together, sometimes hundreds or thousands of them, but it all boils down to math.

In real-world scenarios, your input won’t be a single number but a vector of features. Your model might look like this:

class MultivariateLinearRegression:
    def __init__(self, n_features):
        self.weights = np.zeros(n_features)
        self.bias = 0

    def predict(self, X):
        return X.dot(self.weights) + self.bias

    def fit(self, X, y, lr=0.01, epochs=1000):
        n_samples, n_features = X.shape

        for _ in range(epochs):
            y_pred = self.predict(X)
            # Calculate gradients
            dw = (1 / n_samples) * X.T.dot(y_pred - y)
            db = (1 / n_samples) * np.sum(y_pred - y)

            # Update parameters
            self.weights -= lr * dw
            self.bias -= lr * db

Here, the model has a weight for each feature and a bias term. The fit method uses gradient descent, an iterative process that nudges the weights and bias in the right direction to reduce error. This is the same principle behind training neural networks, just scaled up.

Before you get overwhelmed: this is still very simple compared to what libraries like TensorFlow or PyTorch do under the hood. But building this yourself gives you intuition about what’s really happening when you call model.fit() or model.predict().

Next up, you’ll need to tell your model how to learn-what it means to be “wrong,” and how to fix that. But first, make sure you understand these basics, because every complex model is just a clever arrangement of these fundamental pieces.

Telling your model how to learn and when it’s wrong

So your model is a function with a bunch of knobs (parameters). “Learning” is just the process of turning those knobs until the function produces outputs that are as close as possible to the real answers. But how does the model know which way to turn the knobs? It needs a scorekeeper. In machine learning, this scorekeeper is called a loss function, or sometimes a cost function or objective function. It’s a simple idea: you give it your model’s predictions and the actual correct answers, and it spits out a single number that tells you how wrong the model was. The lower the number, the better.

For regression problems, like the linear regression we just built, the most common loss function is Mean Squared Error (MSE). It’s exactly what it sounds like: for each data point, you calculate the difference between the predicted value and the true value (the error), you square it, and then you take the average of all those squared errors. Squaring the error does two useful things: it makes all the errors positive (so they don’t cancel each other out) and it penalizes larger errors more heavily.

def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

The entire goal of the training process is to find the set of weights and biases that makes this MSE value as small as possible. This is where the optimizer comes in. The gradient descent algorithm we used in the MultivariateLinearRegression class is an optimizer. It calculates the gradient-which is just a fancy word for the slope-of the loss function with respect to each parameter. This gradient tells you which direction is “downhill” for the loss. You then take a small step in that direction, update the parameters, and repeat. Do this thousands of times, and you’ll eventually slide down into a valley where the loss is at a minimum.

The size of that “small step” is controlled by the learning rate (lr). This is arguably the most important hyperparameter you’ll tune. If your learning rate is too high, you’ll take giant steps and constantly overshoot the minimum, like a drunk person trying to walk down a hill. If it’s too low, your model will learn at a glacial pace, taking tiny, inefficient steps. Finding the right learning rate is more art than science, but it’s crucial for effective training.

Let’s modify our regression model to actually track and use this loss. We’ll add a loss calculation inside the training loop so we can see the model getting smarter over time.

class MultivariateLinearRegression:
    def __init__(self, n_features):
        self.weights = np.zeros(n_features)
        self.bias = 0

    def predict(self, X):
        return X.dot(self.weights) + self.bias

    def fit(self, X, y, lr=0.01, epochs=1000):
        n_samples, n_features = X.shape

        for i in range(epochs):
            y_pred = self.predict(X)
            
            # Calculate loss
            loss = np.mean((y_pred - y) ** 2)
            if i % 100 == 0:
                print(f"Epoch {i}, Loss: {loss:.4f}")

            # Calculate gradients
            dw = (1 / n_samples) * X.T.dot(y_pred - y)
            db = (1 / n_samples) * np.sum(y_pred - y)

            # Update parameters
            self.weights -= lr * dw
            self.bias -= lr * db

When you run this, you’ll see the loss value steadily decrease with each epoch. That’s it. That’s learning. It’s not magic; it’s just minimizing a number by repeatedly taking small steps in the right direction.

But what if you’re not predicting a number? What if you’re doing classification, like trying to predict whether an email is spam or not spam? MSE doesn’t make sense here. Your output is a probability (e.g., 80% chance of being spam), not a continuous value. For this, you need a different loss function. The standard choice for binary classification is Binary Cross-Entropy. It’s designed to measure the distance between two probability distributions-in this case, the true distribution (spam is 1, not spam is 0) and your model’s predicted distribution.

The math looks a bit intimidating at first, but the concept is the same. It returns a high value if the model is confidently wrong (e.g., predicts 0.1 for an actual 1) and a low value if the model is confidently right (e.g., predicts 0.9 for an actual 1). The optimizer’s job remains unchanged: minimize this number.

def binary_cross_entropy(y_true, y_pred):
    # Add a small epsilon to avoid log(0)
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

Choosing the right loss function is critical. It’s how you define the problem you’re trying to solve for the model. Use MSE for regression, and use a form of cross-entropy for classification. If you mismatch them, your model will be optimizing for the wrong goal, and you’ll get garbage results, no matter how sophisticated your architecture is.

So is your new brain actually smart

You’ve picked a loss function and an optimizer, and you’ve watched your training loss plummet. It’s a great feeling. You see the numbers going down epoch after epoch, and you think, “Wow, my model is getting really smart!” Hold on. Don’t pop the champagne just yet. A low training loss doesn’t mean your model is smart; it just means it got very good at answering the specific questions you put on its practice test. It might have just memorized the answers. This is called overfitting, and it’s the cardinal sin of machine learning. An overfitted model is useless in the real world because it hasn’t learned the underlying patterns; it’s just a glorified lookup table for the data you showed it.

So how do you know if your model has actually learned something generalizable? You test it on questions it has never seen before. This is the single most important concept in applied machine learning: you must evaluate your model on a hold-out test set. Before you even start training, you split your dataset into two pieces: a training set (usually 70-80% of the data) and a test set (the remaining 20-30%). You train your model only on the training set. The test set stays locked away in a vault, untouched and unseen by the model during the entire training process.

Scikit-learn makes this dead simple. You don’t need to write this code yourself, and you shouldn’t, because it’s easy to get wrong.

from sklearn.model_selection import train_test_split

# Assuming X is your feature matrix and y is your target vector
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Now you train your model ONLY on X_train and y_train
model.fit(X_train, y_train)

# And you evaluate it ONLY on X_test and y_test
predictions = model.predict(X_test)

The random_state parameter is important. It ensures that the split is reproducible. If you run the code again, you’ll get the exact same split, which is crucial for debugging and comparing different models. The performance on the test set is your model’s true report card. That’s the number you show your boss. The training loss is just for you, the engineer, to know if the optimizer is even working.

Now, what number should you actually be looking at? While the loss function is what the optimizer uses to navigate, it’s not always the most human-interpretable metric. For regression, Mean Squared Error is great for optimization, but telling someone the MSE is 14.7 doesn’t mean much. Is that good? Bad? Who knows. A more intuitive metric is Mean Absolute Error (MAE), which is just the average absolute difference between the predictions and the true values. It’s in the same units as your target variable, so you can say, “On average, our model’s price prediction is off by $500.” That’s a statement a non-technical person can understand.

from sklearn.metrics import mean_squared_error, mean_absolute_error

# After getting predictions on the test set
test_mse = mean_squared_error(y_test, predictions)
test_mae = mean_absolute_error(y_test, predictions)

print(f"Test MSE: {test_mse:.4f}")
print(f"Test MAE: {test_mae:.4f}")

For classification, the most straightforward metric is accuracy: what percentage of predictions did the model get right? It’s easy to calculate and easy to understand.

from sklearn.metrics import accuracy_score

# Assuming a classification model
# predictions are class labels (e.g., 0 or 1)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")

But accuracy can be dangerously misleading, especially if your classes are imbalanced. Imagine you’re building a model to detect a rare disease that affects only 1% of the population. A lazy model that simply predicts “no disease” for everyone would have 99% accuracy. It sounds impressive, but it’s completely useless because it never finds the cases you actually care about. This is where you need to be a smarter evaluator. You need to look at a confusion matrix, which breaks down the predictions into True Positives, True Negatives, False Positives, and False Negatives. It tells you not just *how many* predictions were wrong, but *how* they were wrong.

from sklearn.metrics import confusion_matrix, classification_report

# Generate a confusion matrix
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(cm)

# Get a full report with precision, recall, and f1-score
report = classification_report(y_test, predictions)
print("nClassification Report:")
print(report)

This report gives you precision (of all the times the model predicted “disease,” how often was it right?) and recall (of all the actual disease cases, how many did the model find?). These two numbers are often in tension. You can build a very precise model that is only confident about the most obvious cases, but it will have low recall because it misses many others. Or you can build a high-recall model that finds almost all cases but also generates a lot of false alarms (low precision). The right balance depends entirely on your business problem. Is it worse to miss a case of the disease, or to send a healthy person for more tests? That’s not a machine learning question; it’s a product question.

How to implement neural networks with tf.keras in Python

Hiearcool Waterproof Phone Pouch, IPX8 Waterproof Phone Case for iPhone 17/16/15/14 Pro Max, Cell Phone Dry Bag for Beach & Cruise Essentials, Touch Screen, Lanyard Included, 2 Pack (Fits Up to 8.9”)

Your first model is just a stack of lego bricks

Telling your model how to learn and when it’s wrong

So is your new brain actually smart

Comments

Leave a Reply Cancel reply

How to configure warning options with sys.warnoptions in Python

How to estimate object size using sys.getsizeof in Python

How to set recursion limit using sys.setrecursionlimit in Python

How to get Python version info with sys.version_info in Python