Generate product vectors
Before we start to generate embeddigs, we'll need to initialize two new columns in our dataset:
titan_embedding
: to store the embedding vectorstoken_count
: to store the token count for each product title
Skip this step if you're using the pre-embedded dataset, since it already has these two columns
Run this snippet in a new Jupyter Notebook cell:
# Initialize columns to store embeddings and token counts
df_shoes['titan_embedding'] = None # Placeholder for embedding vectors
df_shoes['token_count'] = None # Placeholder for token counts
Next, define a function to generate embeddigs and apply them to the dataset:
# Main function to generate image and text embeddings
def generate_embeddings(df, image_col='image', text_col='product_title', embedding_col='embedding', image_folder=None):
if image_folder is None:
raise ValueError("You must specify an image folder path.")
for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Generating embeddings"):
try:
# Prepare image file as base64
image_path = os.path.join(image_folder, row[image_col])
with open(image_path, 'rb') as img_file:
image_base64 = base64.b64encode(img_file.read()).decode('utf-8')
# Create input data for the model
input_data = {"inputImage": image_base64, "inputText": row[text_col]}
# Invoke AWS Titan model via Bedrock runtime
response = bedrock_runtime.invoke_model(
body=json.dumps(input_data),
modelId="amazon.titan-embed-image-v1",
accept="application/json",
contentType="application/json"
)
response_body = json.loads(response.get("body").read())
# Extract embedding and token count from response
embedding = response_body.get("embedding")
token_count = response_body.get("inputTextTokenCount")
# Validate and save the embedding
if isinstance(embedding, list):
df.at[index, embedding_col] = embedding # Save embedding as a list
df.at[index, 'token_count'] = int(token_count) # Save token count as an integer
else:
raise ValueError("Embedding is not a list as expected.")
except Exception as e:
print(f"Error for row {index}: {e}")
df.at[index, embedding_col] = None # Handle errors gracefully
return df
What the function does​
The function generate_embeddings takes a Pandas DataFrame (df) and generates embeddings (numerical representations) for each image and product text, then saves these embeddings and token count back into the DataFrame.
Parameters​
df
: The DataFrame containing the dataimage_col
: Column name in the DataFrame where image filenames are stored (default:'image'
)text_col
: Column name in the DataFrame where text data (like product titles) is stored (default:'product_title'
)embedding_col
: Column name where embeddings will be stored (default:'embedding'
)image_folder
: The folder path where images are located (must be provided)
When you've added this function, run this next cell to start generating embeddigs.
Running this next function will start calling Amazon Bedrock and inquire a total cost of $0.09.
This whole process takes around 10 minutes.
Run this cell to start generating embeddings:
# Generate embeddings for the product data
df_shoes = generate_embeddings(
df=df_shoes,
embedding_col='titan_embedding',
image_folder='data/footwear'
)
This will start a progress bar where you'll see the process. It takes approximately 10 minutes to generate embeddings for the 1306 pair of shoes in our shoe database.
The vectorization of 1306 shoes takes approximately 10 minutes
Once all shoes are processed, make sure to save the dataset for reuse, so you don't have to generate new embeddigs for this dataset again:
# Save the dataset with generated embeddings to a new CSV file
# Get today's date in YYYY_MM_DD format
today = datetime.now().strftime('%Y_%m_%d')
# Save the dataset with generated embeddings to a CSV file
df_shoes.to_csv(f'shoes_with_embeddings_token_{today}.csv', index=False)
print(f"Dataset with embeddings saved as 'shoes_with_embeddings_token_{today}.csv'")
Running this cell will save a CSV file with all the product data and a 1024 long vector for each product_title
and the token_count
for each product_title.
We have the product vectors, let's prepare the data in the next lesson before we upsert it to the cloud-based vector database Pinecone.