Skip to main content

Generate product vectors

Before we start to generate embeddigs, we'll need to initialize two new columns in our dataset:

  • titan_embedding: to store the embedding vectors
  • token_count: to store the token count for each product title
Skip this step if using the pre-embedded dataset

Skip this step if you're using the pre-embedded dataset, since it already has these two columns

Run this snippet in a new Jupyter Notebook cell:

Jupyter Notebook
# Initialize columns to store embeddings and token counts
df_shoes['titan_embedding'] = None # Placeholder for embedding vectors
df_shoes['token_count'] = None # Placeholder for token counts

Next, define a function to generate embeddigs and apply them to the dataset:

Jupyter Notebook
# Main function to generate image and text embeddings
def generate_embeddings(df, image_col='image', text_col='product_title', embedding_col='embedding', image_folder=None):

if image_folder is None:
raise ValueError("You must specify an image folder path.")

for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Generating embeddings"):
try:
# Prepare image file as base64
image_path = os.path.join(image_folder, row[image_col])
with open(image_path, 'rb') as img_file:
image_base64 = base64.b64encode(img_file.read()).decode('utf-8')

# Create input data for the model
input_data = {"inputImage": image_base64, "inputText": row[text_col]}

# Invoke AWS Titan model via Bedrock runtime
response = bedrock_runtime.invoke_model(
body=json.dumps(input_data),
modelId="amazon.titan-embed-image-v1",
accept="application/json",
contentType="application/json"
)
response_body = json.loads(response.get("body").read())

# Extract embedding and token count from response
embedding = response_body.get("embedding")
token_count = response_body.get("inputTextTokenCount")

# Validate and save the embedding
if isinstance(embedding, list):
df.at[index, embedding_col] = embedding # Save embedding as a list
df.at[index, 'token_count'] = int(token_count) # Save token count as an integer
else:
raise ValueError("Embedding is not a list as expected.")

except Exception as e:
print(f"Error for row {index}: {e}")
df.at[index, embedding_col] = None # Handle errors gracefully

return df

What the function does​

The function generate_embeddings takes a Pandas DataFrame (df) and generates embeddings (numerical representations) for each image and product text, then saves these embeddings and token count back into the DataFrame.

Parameters​

  • df: The DataFrame containing the data
  • image_col: Column name in the DataFrame where image filenames are stored (default: 'image')
  • text_col: Column name in the DataFrame where text data (like product titles) is stored (default: 'product_title')
  • embedding_col: Column name where embeddings will be stored (default: 'embedding')
  • image_folder: The folder path where images are located (must be provided)

When you've added this function, run this next cell to start generating embeddigs.

This next function will start generating vectors and inquire a cost

Running this next function will start calling Amazon Bedrock and inquire a total cost of $0.09.

This whole process takes around 10 minutes.

Run this cell to start generating embeddings:

Jupyter Notebook
# Generate embeddings for the product data
df_shoes = generate_embeddings(
df=df_shoes,
embedding_col='titan_embedding',
image_folder='data/footwear'
)

This will start a progress bar where you'll see the process. It takes approximately 10 minutes to generate embeddings for the 1306 pair of shoes in our shoe database.

The vectorization of 1306 shoes takes approximately 10 minutes

The vectorization of 1306 shoes takes approximately 10 minutes

Once all shoes are processed, make sure to save the dataset for reuse, so you don't have to generate new embeddigs for this dataset again:

Jupyter Notebook
# Save the dataset with generated embeddings to a new CSV file
# Get today's date in YYYY_MM_DD format
today = datetime.now().strftime('%Y_%m_%d')

# Save the dataset with generated embeddings to a CSV file
df_shoes.to_csv(f'shoes_with_embeddings_token_{today}.csv', index=False)
print(f"Dataset with embeddings saved as 'shoes_with_embeddings_token_{today}.csv'")

Running this cell will save a CSV file with all the product data and a 1024 long vector for each product_title and the token_count for each product_title.

We have the product vectors, let's prepare the data in the next lesson before we upsert it to the cloud-based vector database Pinecone.