How I Built My First Ai Project In Python Using Google Colab

Ibrahim Oluwaseun

A passionate programmer.

Introduction

I always believed that creating an AI project was just like trying to climb Mount Everest with just flip-flops. Then I discovered Google Colab, and everything changed. No fancy hardware, no complicated setups, just a browser and a dream is what is needed.

Follow this guide and I will walk you through how I created my first AI project in Python using Google Colab. Spoiler: Easier than you’d expect, to wrap it up, you will actually create your first AI project.

Below are the arears we will explore

Why Google Colab for the beginners?
The required setup of your Colab Google Colab environment.
AI project for beginners (simple but good project).
A step-by-step tutorial to build spam email classifier
Tips for beginners to keep learning and growing.

Ready? Let’s roll.

Why I Used Google Colab for My First AI Project:

Imagine, You are all jazzed up to kick-start your 1st AI project only to have a single Chrome tab squirming on my laptop. Catchy, right? That was I. Then I came across Google Colab and it was like Changing from Tricycle to Tesla.

Why Colab is a Game Changer:

No setup Required: Open your browser and begin writing code. Need no installations, need not configure.
Free access to GPUs: Models Train really slow on a common laptop, Google Colab has free GPU for that.
Collaboration is a breeze: You can share your Colab notebook in a single link. Connect it and Go!
Cloud storage : Data is preserved automatically so no fear of losing any of your progress.

If I can do it, so can you. now let’s get started.

Getting Started with Google Colab

You will first head to Google Colab. On the site, Click “New Notebook,” and you’re in. It is just like walking into a fully stocked kitchen, everything you need is right there.

Google Colab Home Page - Tecxera — Google Colab Home Page

Here’s how to set it up:

Create a New Notebook:
- Click “New Notebook” to create a new project from scratch.
- Give your notebook a name that says what its purpose is, such as “Spam Email Classifier.”
Familiarize yourself with the interface:
- Code cells: You will type and Run your Python here.
- Text cells: To write notes, explanation or headings
- Runtime menu: To manage your session (example, restart, disconnect).

Tip: If your session disconnects, don’t panic. just reconnect and rerun all the cells—kinda like pressing R in a video game but annoying.

Best AI project to start with

Start simple is Everything. I mean my first Project? Email Spam Classifier Why? It´s because nobody likes spam and it is the best way to practice some ai.

Here are a few hands-on ideas for beginners:

Email spam classifier: Teach systems to label mail as spam or not spam.
Number recognition: System to classify MNIST dataset into handwritten digits.
Sentiment analysis: Figure out if a tweet is it happy(?) or sad, or just plain angry.

I went with the spam classifier, because it is real life and has clean, colorful landing pages. And who does not love a clean inbox?

Step by Step Google Colab Tutorial to Create a Spam Email Classifier

Let's build this together. Go along with me, and in no time you will have a fully operational AI model

Step 1: Import Libraries

Before we start cooking up our AI project, we need to gather the right ingredients—Python libraries. Think of these as pre-built tools that save us time and effort. Here’s what we’ll use and why:

1. Pandas (`pandas`):

What it does: Pandas is like a supercharged Excel for Python. It helps us load, explore, and manipulate data in tables (called DataFrames).
Why we need it: Our dataset (spam emails) is stored in a CSV file, and Pandas makes it easy to load and work with this data.

2. Scikit-learn (`sklearn`):

Scikit-learn is a powerful library for machine learning. We’ll use several tools from it:

train_test_split: Splits our dataset into two parts—one for training the model and one for testing it.
CountVectorizer: Converts text (like email content) into numbers so the computer can understand it.
MultinomialNB: This is the Naive Bayes algorithm, which we’ll use to train our spam classifier.
accuracy_score: Helps us measure how well our model performs.

3. How to Import the Libraries:

Here’s the code to import these tools. Copy and paste this into a code cell in your Google Colab notebook:

# Importing the necessary libraries
import pandas as pd  # For handling data
from sklearn.model_selection import train_test_split  # For splitting the dataset
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numbers
from sklearn.naive_bayes import MultinomialNB  # For training the model
from sklearn.metrics import accuracy_score  # For evaluating the model

What to Do Next:

Open your Google Colab notebook.
Create a new code cell by clicking the + Code button.
Copy and paste the code above into the cell.
Press Shift + Enter to run the code.

If everything works, you won’t see any errors, what you will just see is a blank output. This means the libraries are successfully imported and are ready to use.

simple AI project in Python - Tecxera

These libraries are our main project, Without them, we have to write hundred of lines code from the zero. With the help of these tools, we can just stick into the exciting part — building our AI model.

Step 2: Load and Explore the Dataset

Now that we’ve imported our tools, it’s time to get our hands on the data. Think of this step as gathering all the ingredients before you start cooking. For this project, we’ll use a spam email dataset from Kaggle. Here’s how to load and explore it in Google Colab.

1. Download the Dataset

First, we need to get the dataset. Here’s how:

Go to the Kaggle SMS Spam Collection Dataset.
Click the Download button to get the spam.csv file.
Save the file to your computer (usually in the Downloads folder).

2. Upload the Dataset to Google Colab

Since Google Colab runs in the cloud, we need to upload the dataset to our notebook. Here’s how:

In your Colab notebook, run the following code in a new code cell:
from google.colab import files uploaded = files.upload()
```
 
```
After running the code, you’ll see a “Choose File” button. Click it and select the spam.csv file you downloaded.
Once the file is uploaded, you’ll see a confirmation message like this:
Saving spam.csv to spam.csv

3. Load the Dataset into Pandas

Now that the dataset is uploaded, let’s load it into a Pandas DataFrame so we can work with it.

Run the following code in a new code cell:

Here’s what’s happening in the code:
- pd.read_csv('spam.csv'): Loads the dataset into a Pandas DataFrame.
- encoding='latin-1': Ensures the file is read correctly, as some characters in the dataset might cause errors.
- data.head(): Displays the first 5 rows of the dataset so we can take a peek.

After running the code, you’ll see something like this:

	v1	v2
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

beginner-friendly AI project - Tecxera

4. Understand the Dataset

Let’s break down what we’re looking at:

v1 Column: This is the label. It tells us whether the message is ham (not spam) or spam.
v2 Column: This is the text of the email or SMS message.

Our goal is to train a model that can predict whether a message is ham or spam based on its text.

5. Clean Up the Dataset

The dataset has some extra columns we don’t need. Let’s clean it up:

Run the following code in a new code cell:

After running the code, the dataset will look cleaner:

	label	text
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

6. Convert Labels to Numbers

Machine learning models work better with numbers than text. Let’s convert the label column:

ham → 0
spam → 1

Run the following code in a new code cell:

After running the code, the label column will now have 0 for ham and 1 for spam:

	label	text
0	0	Go until jurong point, crazy.. Available only ...
1	0	Ok lar... Joking wif u oni...
2	1	Free entry in 2 a wkly comp to win FA Cup fina...
3	0	U dun say so early hor... U c already then say...
4	0	Nah I don't think he goes to usf, he lives aro...

What to Do Next:

Now that the dataset is clean and ready, we’ll move on to preprocessing the data in the next step.

Why This Matters:

Loading and exploring the dataset is a crucial step. It helps us understand the data we’re working with and ensures it’s in the right format for training our model.

Step 3: Preprocess the Data

We have loaded and processed our dataset, now it's time to prepare the dataset for the AI model. This step is some more of those "chopping vegetables before cooking" steps where you basically need to make all the right ingredients in the right shape to make your recipe work.

Machine learning preprocessing: important because raw data can be full of irregularities or in wrong format which is not good for the model. So we be preprocessing the spam email dataset as follows:

1. Split the Dataset into Training and Testing Sets

Before we train our model, we need to split the dataset into two parts:

Training set: Used to teach the model.
Testing set: Used to evaluate how well the model performs.

This ensures that the model isn’t just memorizing the data but actually learning to generalize.

Run the following code in a new code cell:

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Here’s what’s happening in the code:
- data['text']: The input features (email text).
- data['label']: The target labels (0 for ham, 1 for spam).
- test_size=0.2: 20% of the data will be used for testing, and 80% for training.
- random_state=42: Ensures the split is reproducible (you’ll get the same result every time).
After running the code, you’ll see something like this:
```
Training set size: (4457,)  
Testing set size: (1115,)  
```
This means we have 4,457 emails for training and 1,115 for testing.

2. Convert Text into Numbers

Machine learning models don’t understand text—they work with numbers. So, we need to convert the email text into a numerical format. We’ll use a technique called CountVectorizer, which counts how often each word appears in the text.

Run the following code in a new code cell:

from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_vec = vectorizer.transform(X_test)

# Display the shape of the transformed data
print("Training data shape after vectorization:", X_train_vec.shape)
print("Testing data shape after vectorization:", X_test_vec.shape)

Here’s what’s happening in the code:
- CountVectorizer(): Converts text into a matrix of word counts.
- fit_transform(X_train): Learns the vocabulary from the training data and transforms it into numerical features.
- transform(X_test): Applies the same transformation to the testing data (without relearning the vocabulary).
After running the code, you’ll see something like this:
```
Training data shape after vectorization: (4457, 8672)  
Testing data shape after vectorization: (1115, 8672)  
```
This means the training data has 4,457 emails and 8,672 unique words, while the testing data has 1,115 emails and the same 8,672 words.

3. Understand the Vectorized Data

Let’s break down what the vectorized data looks like:

Each row represents an email.
Each column represents a word in the vocabulary.
The values in the matrix represent how many times each word appears in the email.

For example:

If the word “free” appears 3 times in an email, the corresponding value in the matrix will be 3.
If the word “prize” doesn’t appear, the value will be 0.

This numerical representation allows the model to process the text data effectively.

What to Do Next:

Now that the data is preprocessed, we’re ready to train the model in the next step.

Step 4: Training the Model

With the data preprocessed already, now it is time to train the model. For you, this is the moment where we try explain something to a student and expect the student already being able do solve that problem. We will use the Naive Bayes algorithm; it is easy to understand and great for text classification tasks such as spam detection.

1. What is Naive Bayes?

Naive Bayes is a probabilistic algorithm based on Bayes’ Theorem. It is called “naive” because it is making a strong assumption … each feature (which is each word in this case) could be independently assigned with probability. This may not be holding True all the time, but Naive Bayes works pretty well for text classification problems such as spam detection.

2. Initialize the Model

We have to first create the instance of Naive Bayes model. We will go with the Multinomial Naive Bayesd version which is suitable for word count-type discrete data.

Run the following code in a new code cell:

Here’s what’s happening in the code:
- MultinomialNB(): Creates a Naive Bayes model suitable for text data.

3. Train the Model

Next, we’ll train the model using the training data. This is where the model learns the relationship between the email text (X_train_vec) and the labels (y_train).

Run the following code in a new code cell:

Here’s what’s happening in the code:
- model.fit(X_train_vec, y_train): Teaches the model to predict the labels (y_train) based on the features (X_train_vec).
- print("Model training complete!"): Lets you know the training process is done.
After running the code, you’ll see:
```
Model training complete!
```

4. What Happens During Training?

The model during training is:

Probabilities learned: It figures how likely each word is to appear in a spam email versus a ham email.
- For example, words like “free” or “prize” might have a higher probability of appearing in spam emails.
Makes a decision rule: Based on these probabilities, the model creates a rule to classify new emails as spam or ham.

5. Why Naive Bayes Works Well for Text Data

Efficiency: Naive Bayes is very fast and it works well even with large datasets.
Simplicity: It’s very easy to implement and also to interpret.
Effectiveness: Despite how simple it is, it often performs well on text classification tasks like spam detection.

What to Do Next:

Now that the model is trained, we’ll move to testing its performance in the next step.

Why This Matters:

Training the model is themain part of the machine learning process. It’s where the model learns to make predictions based on the data. Without this training, the model would just be a blank slate with no ability to classify emails.

Step 5: Test the Model

Now that our model is trained, it’s time to see how well it performs. Think of this step as giving the student a test to see if they’ve really learned the material. We’ll use the testing set (the 20% of data we set aside earlier) to evaluate the model’s accuracy.

1. Why Test the Model?

Testing is crucial because it tells us whether the model can generalize to new, unseen data. If the model performs well on the testing set, it’s a good sign that it will work in real-world scenarios.

2. Make Predictions on the Testing Set

First, we’ll use the trained model to predict the labels for the testing data.

Run the following code in a new code cell:

# Use the model to make predictions
predictions = model.predict(X_test_vec)

# Display the first 10 predictions
print("First 10 predictions:", predictions[:10])
print("First 10 actual labels:", y_test[:10])

Here’s what’s happening in the code:
- model.predict(X_test_vec): Uses the trained model to predict whether each email in the testing set is spam (1) or ham (0).
- predictions[:10]: Shows the first 10 predictions.
- y_test[:10]: Shows the first 10 actual labels for comparison.
After running the code, you’ll see something like this:
```
First 10 predictions: [0 0 0 0 0 1 0 0 0 0]  
First 10 actual labels: [0 0 0 0 0 1 0 0 0 0]  
```
This means the model correctly predicted the first 10 emails in the testing set.

3. Evaluate the Model’s Accuracy

Next, we’ll calculate the model’s accuracy—the percentage of correct predictions.

Run the following code in a new code cell:

from sklearn.metrics import accuracy_score

# Calculate the accuracy
accuracy = accuracy_score(y_test, predictions)

# Display the accuracy as a percentage
print("Model accuracy:", accuracy * 100, "%")

Here’s what’s happening in the code:
- accuracy_score(y_test, predictions): Compares the predicted labels (predictions) with the actual labels (y_test) and calculates the accuracy.
- accuracy * 100: Converts the accuracy to a percentage for easier interpretation.
After running the code, you’ll see something like this:
```
Model accuracy: 98.39 %  
```
This means the model correctly classified 98.39% of the emails in the testing set.

machine learning for beginners

4. Why Accuracy Matters

Accuracy is a key metric for evaluating classification models. Here’s how to interpret it:

High accuracy (e.g., >95%): The model is performing well and can be trusted for real-world use.
Low accuracy (e.g., <80%): The model may need improvement (e.g., more data, better features, or a different algorithm).

In our case, an accuracy of 98.39% is excellent for a first attempt!

5. Confusion Matrix (Optional)

For a deeper understanding of the model’s performance, we can create a confusion matrix. This shows how many predictions were correct and how many were incorrect for each class (spam and ham).

Run the following code in a new code cell:

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Create the confusion matrix
cm = confusion_matrix(y_test, predictions)

# Display the confusion matrix as a heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Here’s what’s happening in the code:
- confusion_matrix(y_test, predictions): Creates a matrix showing the number of true positives, true negatives, false positives, and false negatives.
- sns.heatmap(): Visualizes the confusion matrix as a heatmap for easier interpretation.
After running the code, you’ll see a heatmap like this:
```
Actual vs Predicted:
            Ham   Spam  
Ham        1080     5  
Spam         15    15  
```
- True Positives (15): Spam emails correctly classified as spam.
- True Negatives (1080): Ham emails correctly classified as ham.
- False Positives (5): Ham emails incorrectly classified as spam.
- False Negatives (15): Spam emails incorrectly classified as ham.

What to Do Next:

Now that we’ve tested the model and evaluated its performance, we’ll move on to saving the model in the next step so you can use it later.

Why This Matters:

Testing the model is essential to ensure it works well on new data. Without testing, we wouldn’t know if the model is reliable or just memorizing the training data.

Step 6: Save and Use the Model

Now that we’ve built and tested our spam email classifier, it’s time to save it so we can use it later. Think of this step as packaging up a delicious meal you’ve cooked so you can enjoy it later. Saving the model allows us to reuse it without having to retrain it every time.

1. Why Save the Model?

Saving the model is important because:

Reusability: You can use the model to classify new emails without retraining it.
Portability: You can share the model with others or deploy it in an application.
Efficiency: Retraining the model from scratch every time would be time-consuming and unnecessary.

2. Save the Model Using Joblib

We’ll use the Joblib library to save the model. Joblib is a lightweight library that’s great for saving Python objects, especially machine learning models.

Run the following code in a new code cell:

Here’s what’s happening in the code:
- joblib.dump(model, 'spam_classifier.pkl'): Saves the trained model to a file named spam_classifier.pkl.
- print("Model saved as 'spam_classifier.pkl'"): Confirms that the model has been saved.
After running the code, you’ll see:
```
Model saved as 'spam_classifier.pkl'
```

3. Download the Model File

Since Google Colab runs in the cloud, the saved model file is stored in the Colab environment. To use it on your local machine, you’ll need to download it.

Run the following code in a new code cell:

Here’s what’s happening in the code:
- files.download('spam_classifier.pkl'): Downloads the spam_classifier.pkl file to your computer.
After running the code, your browser will prompt you to save the file. Choose a location on your computer (e.g., Downloads folder).

4. Load and Use the Model

Once the model is saved, you can load it later to classify new emails. Here’s how:

Load the Model:

Classify a New Email:
Let’s say you have a new email: "Congratulations! You've won a $1000 Walmart gift card. Click here to claim your prize."

Here’s how to classify it:

# New email text
new_email = ["Congratulations! You've won a $1000 Walmart gift card. Click here to claim your prize."]

# Convert the text into numerical features
new_email_vec = vectorizer.transform(new_email)

# Make a prediction
prediction = loaded_model.predict(new_email_vec)

# Display the result
if prediction[0] == 1:
    print("This email is spam.")
else:
    print("This email is ham.")

After running the code, you’ll see:
```
This email is spam.
```

5. Why This Matters

Saving and reusing the model makes it practical for real-world applications. For example, you could:

Integrate the model into an email client to automatically filter spam.
Build a web app where users can input emails and see if they’re spam.
Share the model with others so they can use it without having to train it themselves.

What to Do Next:

Now that you’ve saved the model, you can use it anytime to classify new emails. You can also explore other projects, like building a sentiment analysis tool or a number recognition system.

Why This Matters:

Saving the model is the final step in the machine learning pipeline. It ensures that all your hard work isn’t lost and can be reused in the future.

Beginner’s Tips

You might find it hard at first, here some tips for you:

Start small: Don't try building Skynet on the first day.
Use online resources: Great resources are available to tap in Kaggle, GitHub.
Join communities. Tecxera’s r/learnmachinelearning is a great place to ask questions.
Practice regularly: The more you build the better developer you become.

Remember, even the best AI experts started where you are now.

Showcasing Your Project

After you make, your project SHOW IT! Post it on GitHub, LinkedIn, or boast at your next family BBQ.

Not only does sharing your work make you feel great, but it can also help others. In addition, this would be the best way to expand your portfolio.

Conclusion

Creating your first AI project might look like learning to ride a bike. It may seem unsteady at first but pretty soon there is no taking you off this.

Then what are you thinking of. Open Google Colab, choose a project and start building. When you are done, share it with me in the comments. I would LOVE to see what you come up with!

Comments

Anon

2 months ago

Oook

How I Built My First Ai Project In Python Using Google Colab