Welcome back, future multimodal AI maestros! In our previous chapter, we explored the exciting world of multimodal AI and its incredible potential. Now, it’s time to dive deeper and understand the fundamental step that makes all this magic possible: transforming the messy, diverse “real world” data into a language our AI models can understand.

This chapter is all about representing reality. We’ll learn how raw inputs like text, images, audio, and video, which seem so different to us, are converted into a common, numerical format called embeddings. Think of it as teaching your AI system to “see,” “hear,” and “read” by giving it a universal dictionary of meaning. Mastering this concept is crucial, as it forms the bedrock for any multimodal system you’ll ever build.

By the end of this chapter, you’ll understand:

  • Why numerical representations are essential for AI.
  • What embeddings are and why they’re so powerful.
  • How different types of raw data (text, images, audio, video) are encoded into embeddings using state-of-the-art deep learning techniques.
  • The conceptual pathway for audio and video data, which we’ll touch on.

Ready to turn pixels, words, and sounds into intelligent numbers? Let’s get started!

The Foundation: From Raw Data to Numerical Language

Imagine you want to teach a child what a “cat” is. You show them pictures, make cat sounds, say the word “cat,” and maybe even show them a video of a cat playing. As humans, our brains effortlessly combine these different sensory inputs to form a rich understanding.

But computers don’t have eyes or ears in the same way. They operate on numbers. A raw image is just a grid of pixel values (numbers representing color and intensity). A raw audio file is a sequence of amplitude values over time. Text is a string of characters. How do we bridge this gap?

The answer lies in transforming this raw, high-dimensional, and often noisy data into a structured, numerical format that captures its underlying meaning. This is where embeddings come in.

What are Embeddings? The AI’s Secret Language

At their core, embeddings are dense, low-dimensional numerical vectors that represent objects (like words, images, or even entire sentences or concepts) in a continuous vector space.

Think of it like this:

  • If you wanted to describe a person, you wouldn’t just list their raw DNA sequence. You’d use descriptive attributes like height, age, interests, and personality traits.
  • Similarly, an embedding for a word like “king” isn’t just a random number; it’s a vector where each dimension might implicitly represent a semantic feature like “royalty,” “male,” “adult,” etc. Words with similar meanings will have embeddings that are “close” to each other in this vector space.

Why are embeddings so powerful?

  1. Semantic Meaning: They capture the meaning and relationships between data points. “King” and “queen” will be closer than “king” and “apple.”
  2. Dimensionality Reduction: Raw data (e.g., a high-resolution image) can have millions of dimensions. Embeddings compress this information into a much smaller, manageable vector (e.g., 512 or 768 dimensions) while retaining crucial information.
  3. Input for AI Models: Deep learning models, especially neural networks, thrive on numerical inputs. Embeddings provide this structured, meaningful input.
  4. Transfer Learning: Pre-trained embedding models can be leveraged for new tasks, saving immense training time and data.

Let’s visualize this process conceptually:

flowchart LR subgraph Raw_Data["Raw Multimodal Data"] A[Text Input] B[Image Input] C[Audio Input] D[Video Input] end subgraph Modality_Encoders["Modality-Specific Encoders"] E[Text Encoder] F[Image Encoder] G[Audio Encoder] H[Video Encoder] end subgraph Embeddings_Space["Numerical Embeddings"] I[Text Embedding] J[Image Embedding] K[Audio Embedding] L[Video Embedding] end A --> E B --> F C --> G D --> H E --> I F --> J G --> K H --> L I --> M[Common Multimodal Space] J --> M K --> M L --> M

Figure 2.1: The journey from raw multimodal data to a common embedding space.

As you can see, each raw data type first goes through a specialized encoder designed for its modality. These encoders are typically deep neural networks, often pre-trained on massive datasets. The output of each encoder is a dense vector – our embedding! Ultimately, these embeddings can then be brought together into a “Common Multimodal Space,” which we’ll explore more in later chapters.

Modality-Specific Encoders: The Specialists

Each data type requires a different approach to extract meaningful features. Let’s look at the “specialists” – the encoders for each modality.

Text Encoders: Understanding Language

Text is perhaps the most familiar data type for AI. Before embeddings, words were often represented by one-hot encodings, which are sparse and don’t capture relationships. Modern text encoders are far more sophisticated.

  1. Tokenization: The first step is to break down raw text into smaller units called “tokens.” These can be words, subwords (like “un-” or “-ing”), or characters.
    • Example: “Multimodal AI” -> [“Multi”, “##modal”, “AI”]
  2. Word Embeddings (Historical Context): Earlier techniques like Word2Vec and GloVe learned fixed-size embeddings for each word by analyzing its context in large text corpora.
  3. Transformer-based Encoders (The Modern Powerhouse): Today, the dominant approach for text encoding uses Transformer models. Models like BERT, RoBERTa, T5, and more recently, the encoder components of large language models (LLMs) like Llama 3, Mistral, and Gemma, are exceptionally good at generating contextualized embeddings.
    • These models don’t just assign a fixed vector to a word; they generate an embedding for each word (or token) that changes based on the surrounding words in the sentence. This allows them to capture nuances like “bank” (river bank) vs. “bank” (financial institution).
    • When you feed a sentence into a Transformer encoder, it outputs a sequence of vectors, one for each input token. To get a single “sentence embedding,” these token embeddings are often averaged or a special [CLS] token’s embedding is used.

Image Encoders: Seeing the World

Images are grids of pixels. To an AI, a cat image is just a massive array of numbers. Image encoders must extract features like edges, textures, shapes, and eventually, high-level concepts like “cat” or “sky.”

  1. Convolutional Neural Networks (CNNs): For a long time, CNNs (e.g., ResNet, EfficientNet) were the go-to for image processing. They use convolutional layers to progressively learn hierarchical features, from simple patterns to complex objects.
  2. Vision Transformers (ViT): Inspired by the success of Transformers in NLP, Vision Transformers (ViT) have revolutionized image encoding. They treat an image as a sequence of small “patches” (like tokens in text) and process them using self-attention mechanisms.
    • Models like Google’s ViT and Meta’s DINOv2 are powerful examples.
  3. Contrastive Learning (e.g., CLIP): A very important development for multimodal AI is models trained with contrastive learning, such as OpenAI’s CLIP (Contrastive Language-Image Pre-training). CLIP learns to encode images and text into the same embedding space by predicting which text captions go with which images. This creates a bridge between vision and language from the ground up!

Audio Encoders: Hearing the Sounds

Audio data consists of sound waves, typically represented as a waveform (amplitude over time). Encoding audio involves transforming these raw waves into features that represent speech, music, or environmental sounds.

  1. Feature Extraction: Raw audio is often preprocessed into spectrograms (visual representations of frequency over time) or Mel-frequency Cepstral Coefficients (MFCCs), which are more perceptually relevant.
  2. Recurrent Neural Networks (RNNs) and CNNs: Historically, RNNs (like LSTMs) and 1D CNNs were used to process sequential audio data or spectrograms.
  3. Transformer-based Audio Models: Similar to text and images, Transformers have made huge strides in audio. Models like Wav2Vec 2.0 and HuBERT learn rich contextual representations directly from raw audio waveforms, often through self-supervised pre-training. They excel at tasks like speech recognition and audio classification.

Video Encoders: Understanding Motion and Context

Video is the most complex modality, essentially a sequence of images combined with audio. Encoding video involves capturing both spatial information (what’s in each frame) and temporal information (how things change over time).

  1. Frame-level Processing + Aggregation: One common approach is to process individual frames using an image encoder (like a CNN or ViT) and then aggregate these frame embeddings over time using RNNs or temporal pooling.
  2. 3D Convolutional Neural Networks (3D CNNs): These networks apply convolutions across both spatial dimensions (width, height) and the temporal dimension (time), allowing them to directly learn features that capture motion.
  3. Video Transformers: Models like VideoMAE (Masked Autoencoders for Video) and MViT (Multiscale Vision Transformers) extend the Transformer architecture to video, processing sequences of video “patches” or frames. They are adept at capturing long-range temporal dependencies.

The Bridge: Learning a Common Multimodal Space

While each encoder specializes in its own modality, the ultimate goal in multimodal AI is often to bring these different types of embeddings into a common, shared embedding space. Why? Because in this space, an image of a cat, the word “cat,” and the sound of a “meow” would all be represented by vectors that are close to each other.

This shared space allows AI models to:

  • Perform cross-modal retrieval (e.g., find images based on a text query).
  • Generate one modality from another (e.g., generate text descriptions from images).
  • Reason about concepts that span multiple modalities.

Techniques like contrastive learning (as seen in CLIP) are particularly effective at training models to project different modalities into such a unified space.

Step-by-Step: Getting Embeddings with Python

Let’s get our hands dirty (or rather, our keyboards typing!) and see how we can generate embeddings for text and images using the powerful transformers library from Hugging Face. This library provides easy access to hundreds of state-of-the-art pre-trained models.

For this exercise, we’ll need Python 3.10+ and the transformers, torch, Pillow, and requests libraries. If you don’t have them installed, open your terminal and run:

pip install transformers~=4.38.0 torch~=2.2.0 Pillow~=10.2.0 requests~=2.31.0

Note: As of 2026-03-20, these are recent stable versions. torch (PyTorch) is a deep learning framework that transformers often uses. Ensure you have a compatible version installed. For GPU support, consult the official PyTorch installation guide.

Step 1: Text Embeddings

We’ll use a DistilBERT model, a smaller, faster version of BERT, perfect for demonstration.

Create a new Python file, say embeddings_generator.py, and add the following code:

# embeddings_generator.py

import torch
from transformers import AutoTokenizer, AutoModel

print("--- Generating Text Embedding ---")

# 1. Choose a pre-trained model
# "distilbert-base-uncased" is a good, lightweight general-purpose model.
# For more advanced sentence embeddings, "sentence-transformers/all-MiniLM-L6-v2" is popular.
model_name_text = "distilbert-base-uncased"

# 2. Load the tokenizer and model
# The tokenizer converts text into numerical IDs that the model understands.
# The AutoModel class automatically loads the correct model architecture.
tokenizer = AutoTokenizer.from_pretrained(model_name_text)
model_text = AutoModel.from_pretrained(model_name_text)

# 3. Prepare your text input
text_input = "Multimodal AI is revolutionizing how we interact with technology."
print(f"\nOriginal Text: \"{text_input}\"")

# 4. Tokenize the text and convert to PyTorch tensors
# `return_tensors="pt"` ensures the output is a PyTorch tensor.
# `padding=True` and `truncation=True` are useful for batches of text.
inputs = tokenizer(text_input, return_tensors="pt", padding=True, truncation=True)
print(f"Tokenized Input IDs (first 10): {inputs['input_ids'][0][:10].tolist()}...")
print(f"Attention Mask (first 10): {inputs['attention_mask'][0][:10].tolist()}...")

# 5. Get the embeddings from the model
# `torch.no_grad()` turns off gradient calculation, saving memory and speeding up inference.
with torch.no_grad():
    outputs = model_text(**inputs)

# 6. Extract the final embedding
# For many Transformer encoders, the `last_hidden_state` contains the embeddings for each token.
# To get a single sentence embedding, we often average the token embeddings (mean pooling)
# or use the embedding of the special `[CLS]` token (which is often the first token).
# Here, we'll use mean pooling for simplicity.
text_embedding = outputs.last_hidden_state.mean(dim=1)

print(f"Text Embedding Shape: {text_embedding.shape}")
print(f"Example Text Embedding (first 5 values): {text_embedding[0][:5].tolist()}")

Explanation:

  • AutoTokenizer.from_pretrained(): This loads a pre-trained tokenizer specific to our chosen model. It handles breaking text into tokens and mapping them to numerical IDs.
  • AutoModel.from_pretrained(): This loads the pre-trained neural network model itself.
  • tokenizer(text_input, ...): This is where the magic of tokenization happens. It converts your human-readable text into a dictionary of tensors, including input_ids (the numerical representation of tokens) and attention_mask (which tells the model to ignore padding tokens).
  • with torch.no_grad():: When you’re just using a model for inference (getting predictions or embeddings) and not training it, you don’t need to calculate gradients. This context manager disables gradient computation, which makes the process faster and uses less memory.
  • model_text(**inputs): We pass the tokenized inputs to the model. The ** unpacks the dictionary into keyword arguments.
  • outputs.last_hidden_state.mean(dim=1): The output of our model is a tuple or dictionary. last_hidden_state is a tensor containing the contextualized embedding for each token in the input sequence. We apply mean(dim=1) to average these token embeddings across the sequence dimension, resulting in a single vector representing the entire sentence. The shape [1, 768] means one embedding vector of 768 dimensions.

Step 2: Image Embeddings

Now let’s do the same for an image. We’ll use a ViTModel (Vision Transformer) from Google.

Add the following code to the same embeddings_generator.py file:

# ... (previous text embedding code)

from PIL import Image
import requests
from io import BytesIO
from transformers import ViTImageProcessor, ViTModel

print("\n--- Generating Image Embedding ---")

# 1. Choose a pre-trained Vision Transformer model
# "google/vit-base-patch16-224" is a foundational Vision Transformer.
model_name_image = "google/vit-base-patch16-224"

# 2. Load the image processor and model
# The image processor handles image preprocessing (resizing, normalization).
# Note: ViTFeatureExtractor has been largely superseded by ViTImageProcessor for newer versions.
processor = ViTImageProcessor.from_pretrained(model_name_image)
model_image = ViTModel.from_pretrained(model_name_image)

# 3. Load an example image
# We'll fetch an image from a URL. Replace with a local path if preferred.
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_anatomy_vit.png"
print(f"\nLoading image from: {image_url}")

response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

print(f"Original Image Size: {image.size}")
# image.show() # Uncomment to display the image

# 4. Preprocess the image and convert to PyTorch tensors
# The processor prepares the image in the format expected by the model.
inputs_image = processor(images=image, return_tensors="pt")

# 5. Get the embeddings from the model
with torch.no_grad():
    outputs_image = model_image(**inputs_image)

# 6. Extract the final image embedding
# For ViT, the first token's output (`[:, 0, :]`) often corresponds to the [CLS] token,
# which is designed to represent the entire image.
image_embedding = outputs_image.last_hidden_state[:, 0, :]

print(f"Image Embedding Shape: {image_embedding.shape}")
print(f"Example Image Embedding (first 5 values): {image_embedding[0][:5].tolist()}")

Explanation:

  • ViTImageProcessor.from_pretrained(): This loads the preprocessing logic specific to the Vision Transformer. It handles tasks like resizing the image to the model’s expected input size (e.g., 224x224 pixels) and normalizing pixel values. Note: ViTFeatureExtractor is an older class; ViTImageProcessor is the modern, preferred equivalent as of transformers 4.38+ for image models.
  • Image.open(BytesIO(response.content)): This is a common way to load an image directly from a URL into a Pillow Image object.
  • processor(images=image, ...): The image is preprocessed into a tensor, ready for the model.
  • outputs_image.last_hidden_state[:, 0, :]: Similar to text Transformers, ViTs output a sequence of embeddings (one for each image patch). The embedding corresponding to the special [CLS] token (which is often at index 0) is typically used as the overall image representation. The shape [1, 768] means one embedding vector of 768 dimensions.

By running this script, you’ll see the shapes and example values of the generated text and image embeddings. Notice how they are both dense numerical vectors, ready to be processed by further AI components!

Mini-Challenge: Explore a Different Model

Now it’s your turn to experiment!

Challenge: Modify the embeddings_generator.py script to do one of the following:

  1. Text: Change the model_name_text to "sentence-transformers/all-MiniLM-L6-v2". This model is specifically designed to produce good sentence embeddings. Observe if the embedding dimension changes.
  2. Image: Find a different pre-trained Vision Transformer model on the Hugging Face Model Hub (e.g., search for vit or beit) and use its ViTImageProcessor and ViTModel to get an image embedding. Pay attention to the model’s expected input size if mentioned.
  3. Your Own Data: Try generating an embedding for a piece of text or an image of your own. For a local image, you can replace the requests part with Image.open("path/to/your/image.jpg").

Hint: The Hugging Face Model Hub (huggingface.co/models) is your best friend for finding pre-trained models and their documentation. Just search for a model and look for its AutoTokenizer/AutoModel or ViTImageProcessor/ViTModel usage examples.

What to Observe/Learn:

  • How different models might produce embeddings of different dimensions.
  • The ease with which you can swap out pre-trained models using the Hugging Face library.
  • The consistency of the process: tokenize/extract features, pass to model, extract embedding.

Common Pitfalls & Troubleshooting

Working with embeddings, especially across modalities, can sometimes throw a curveball. Here are a few common issues and how to approach them:

  1. Preprocessing Mismatches:

    • Problem: Your data isn’t preprocessed exactly as the model expects (e.g., wrong image size, incorrect normalization, inconsistent tokenization).
    • Solution: Always use the AutoTokenizer and AutoImageProcessor (or their specific counterparts like ViTImageProcessor) that come with the specific pre-trained model you are using. These objects are designed to ensure your input matches the model’s training data. Check the model’s documentation on Hugging Face for expected input formats.
  2. Computational Resources:

    • Problem: Generating embeddings for large batches of data or very large models can be memory-intensive, especially on CPU.
    • Solution: If you have a GPU, ensure your PyTorch tensors are moved to the GPU (.to("cuda")) before passing them to the model. For very large models, consider using smaller versions (like DistilBERT instead of BERT), batching your inputs, or leveraging techniques like gradient accumulation during training (though less relevant for pure inference).
  3. Semantic Drift/Domain Mismatch:

    • Problem: The embeddings generated don’t seem to capture the nuances of your specific domain (e.g., medical images, highly specialized scientific text).
    • Solution: Pre-trained models are powerful, but they are trained on general data. For highly specialized domains, you might need to fine-tune the pre-trained encoder on a dataset specific to your domain. This process adapts the model’s weights to better understand your niche.
  4. Incorrect Embedding Extraction:

    • Problem: You’re getting an output from the model, but it’s not the “right” embedding for your task (e.g., using token embeddings instead of a sentence embedding, or a patch embedding instead of a whole-image embedding).
    • Solution: Carefully read the documentation for the specific model you’re using. For Transformers, the [CLS] token’s output (outputs.last_hidden_state[:, 0, :]) or mean pooling of last_hidden_state are common ways to get a sequence-level embedding. For vision models, it’s often the [CLS] token or a pooled output.

Summary

Phew! You’ve just taken a monumental step in understanding multimodal AI. Here’s a quick recap of our journey:

  • The Necessity of Embeddings: AI models need numerical inputs, and raw data is too high-dimensional and unstructured. Embeddings bridge this gap by converting data into dense, meaningful vectors.
  • What are Embeddings? They are numerical representations that capture the semantic meaning and relationships of data points in a low-dimensional vector space.
  • Modality-Specific Encoders: Each data type (text, image, audio, video) requires specialized deep learning architectures (Transformers, CNNs, ViTs, Wav2Vec 2.0, VideoMAE) to effectively extract features and generate embeddings.
  • Common Multimodal Space: The ultimate goal is often to project these modality-specific embeddings into a shared space where different data types can be compared and understood together.
  • Practical Application: We successfully used the Hugging Face transformers library to generate text and image embeddings, demonstrating how accessible this technology is.

You now have a solid grasp of how multimodal AI systems begin to “perceive” the world. In the next chapter, we’ll take these powerful embeddings and explore how to combine them effectively through various data fusion techniques, bringing us even closer to building truly intelligent multimodal systems!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.