Machine Learning Featured

SBERT sentence transformer for semantic search within your Gmail inbox

SBERT sentence transformer for semantic search within your Gmail inbox. I was curious what would happen if I could ask these sort of questions to my Gmail account: How can I learn machine learning? The latest news on computer vision?

Igor Rendulic

Mar 29, 2023 • 8 min read

Photo by Stephen Phillips - Hostreviews.co.uk / Unsplash

Asking Gmail question

I was curious what would happen if I could ask these sort of questions to my Gmail account:

How can I learn machine learning?
The latest news on computer vision?

Idea

I'm using an email hosted by Gmail as spam email. I've signed up for various services over the years and ended up with bunch of unwanted emails from Quora, Uber, Banana Republic,...

Despite the "unwantedness" maybe those emails do indicate my interests. Wouldn't it be fun to ask those emails human like questions and see what kind of answers I get within my own over-subscribed echo-chamber?

TL;DR

After extracting 10000 of my latest emails from Gmail and using SBERT sentence transformer conducting Semantic Search on my email corpus the answer on

top 1 answer for question 1: How can I learn machine learning? was

top 1 answer for question 2: The latest news on computer vision? was

Code

Prerequisite: Connect to Gmail API

In order to access Gmail API you need to create a project (or use existing one) here: https://cloud.google.com/

After the project is selected go to APIs & Services and click Enabled APIs & Services:

Click on + ENABLE APIS AND SERVICES which you can find on the top of the screen.

Search for gmail api and select it when found, then enable it.

You need to fill out the OAuth Consent Screen which you can find navigating back to the main menu under APIs and Services:

Fill out the consent screen by entering on

Page 1:

App name (e.g. Gmail Semantic Search)
User support email
Developer contact information

Page 2:

Click on Add or remove scopes and add this scope in since we only want to read our emails: https://www.googleapis.com/auth/gmail.readonly

Save and continue.

Page 3:

Add test users. Add your gmail account/s you'd wish to test Semantic search on.

When this is done, navigate to main menu and select Credentials from APIs and Services

Click Create credentials and go to OAuth Client ID.

Choose application type as Desktop Application
Enter the Application name and click the Create button
The Client ID will be generated. Download it to your compuster and save it as credentials.json.

Keep your Client ID and Client Secret confidential.

Python Code Quick Walk-through

Allowing access to our Gmail messages. Here we need to access to formrly created credentials.json in order to be able to authenticate our Gmail account and allow reading our messages.

import os
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.errors import HttpError
import google.auth.exceptions
from googleapiclient.discovery import build

# If modifying these scopes, delete the file token.json.
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']

creds = None
# The file token.json stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.json'):
    try:
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
        creds.refresh(Request())
    except google.auth.exceptions.RefreshError as error:
        # if refresh token fails, reset creds to none.
        creds = None
        print(f'An error occurred: {error}')
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            'credentials.json', SCOPES)
        # creds = flow.run_local_server(port=0)
        creds = flow.run_console()
    # Save the credentials for the next run
    with open('token.json', 'w') as token:
        token.write(creds.to_json())

Since we're using Colab the above code will request us to follow a link and then paste the validation code back to our program code input.

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=123456.abcdef.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fgmail.readonly&state=woueP4lJfmYWBYbOlfteLpeuNr0Da2&prompt=consent&access_type=offline
Enter the authorization code:

Listing and reading our emails using Gmail API. We paginate until approximately CUT_OFF is reached or the end of email messages reached:

emails = [] # indvidiual raw emails stored
CUT_OFF = 10000 # maximum number of emails to be downloaded
try:
  service = build("gmail", "v1", credentials=creds)
  gmail_messages = service.users().messages()
  
  has_next_token = True
  next_token = None
  count = 0
  while(has_next_token):
    results = gmail_messages.list(userId='me', pageToken=next_token).execute()
    messages = results["messages"]
    if "nextPageToken" in results:
      next_token = results["nextPageToken"]
    else:
      next_token = None
      has_next_token = False
    size_estimate = results["resultSizeEstimate"]
    print(f"next_token {next_token}, size estimate: {size_estimate}")
    for msg in messages:
      msg_dict = gmail_messages.get(userId='me', id=msg['id'], format='raw').execute()
      emails.append(msg_dict)
    count += len(emails)
    if len(emails) > CUT_OFF:
      has_next_token = False
      next_token = None
except HttpError as error:
  print(f'An error occurred: {error}')

Since we're retrieving raw emails we need to parse then. Using BeautifulSoup we strip out all HTML tags and we're left with bare text which we'll use as our sentence corpus (let's call it "email knowledge").

import email
from email import policy
from email.parser import Parser
from io import BytesIO
import io

parser = Parser(policy=policy.default)

# reference content dictionary
content_dictionary = {
}

counter = 0
for email in emails:
  msg_str = base64.urlsafe_b64decode(email['raw'].encode('ASCII'))
  
  subject = from_email = to_email = None
  msg = parser.parsestr(msg_str.decode('utf-8'))
  for key in msg.keys():
    if key == "Subject":
      subject = msg.get_all("Subject")
    if key == "From":
      from_email = msg.get_all("From")
    if key == "To":
      to_email = msg.get_all("To")
  print(f"from: {from_email}, subject: {subject}")
  for part in msg.walk():
     if part.get_content_type() == "text/html":
       content = part.get_content()
       clear_content = remove_tags(content)
       content_dictionary[counter] = {
           "id": email["id"],
           "subject": subject,
           "from": from_email,
           "to": to_email,
           "body": clear_content,
           "snippet": email["snippet"]
       }
       counter += 1

Here the meat of our code. We load a sentence transformer that will create for us "email knowledge" embeddings and later also question embeddings. In this case I'm using all-MiniLM-L6-v2 which is a small model, but it should do the job. You can try replacing this with a bigger model such as all-mpnet-base-v2 and see if there is any improvement in the results.

from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

Create email knowledge embeddings which represents our knowledge database in vector format (384 dimensions).

corpus = [c["body"] for c in content_dictionary.values()]
print(f"embedding {len(corpus)} sentences")
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

We also need our queries to use the same "embedder". The query is a simple 384 dimension vector which needs to be matched against our email knowledge database via similarity search.

# Query sentences:
queries = ['how can I learn about machine learning?', 'the latest news in computer vision']

The crux of this piece of code is utils.semantic_search which performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings.

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(1, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
	hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=2)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        id = hit['corpus_id']
        item = content_dictionary[id]
        print(f"query: {query}, subject: {item['subject']}, {item['snippet']} https://mail.google.com/mail/u/2/#all/{item['id']}")

Results:

query: how can I learn about machine learning?, subject: ['Recommended: Linear Algebra for Machine Learning and Data Science'], Ready to learn something new? ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏͏‌ ͏‌ ͏‌ ͏‌ ͏‌ https://mail.google.com/mail/u/2/#all/1872331840302f51
query: how can I learn about machine learning?, subject: ['Intro to Coding Online Course - Applications now open!'], Hi Igor Rendulić, Want to learn how to code? Now&#39;s your chance to join Code in Place 2023 – a FREE, 6-week online course that covers the fundamentals of computer programming using the Python https://mail.google.com/mail/u/2/#all/187251157c88c4f7
query: the latest news in computer vision, subject: ['GTC 2023 Day 4: Top Finale Moments 🏁'], Hello Igor, This is Satya Mallick from LearnOpenCV.com. Welcome to Day 4 and the last day of Spring GTC 2023 as we end this season&#39;s GTC on a high note 🍾 We have summarized the highlights from Day https://mail.google.com/mail/u/2/#all/18711445e87740e9
query: the latest news in computer vision, subject: ['New Tutorial ✍🏼: Advanced Image Editing using InstructPix2Pix and prompts'], Do you like to see magic? Read on. . . ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ https://mail.google.com/mail/u/2/#all/1872855eebb8e591

Conclusion

The results are surprisingly good even with the `all-MiniLM-L6-v2 model. I haven't run any of the larger models since I don't posses the cash for GPUs on Google Colab (free goes by so fast :) ).

But if I could of done it and get back some nifty relevant results then the future of semantic search is here baby!

It's important not to forget that every answer is very promotional since the email knowledge database is a set of unwanted "spammy" emails.