Day 2 – Embeddings & Vector Databases

Day 2 – Embeddings & Vector Databases #


1. Why Embeddings? #

We begin with the core problem of representing diverse data types. Images, text, audio, and structured data all need to be compared, retrieved, and clustered. Embeddings map these into a shared vector space where similarity can be computed numerically.

→ how can we measure and preserve semantic meaning across different data types?

2. Mapping Data to Vector Space #

Embeddings reduce dimensionality while preserving meaning. For example, just like latitude and longitude embed Earth’s surface into 2D coordinates, BERT embeds text into 768D space. Distances represent semantic similarity.

→ how do different embedding models impact representation fidelity and downstream performance?

3. Key Applications #

Embeddings power:

  • Search (e.g., RAG, internet-scale)
  • Recommendations
  • Fraud detection
  • Multimodal integration (e.g., text + image)

→ how do we design joint embeddings for multi-modal tasks?

4. Quality Metrics #

Evaluation focuses on how well embeddings retrieve similar items:

  • Precision@k: Are top results relevant?
  • Recall@k: Do we get all relevant items?
  • nDCG: Are the most relevant ranked highest?

→ how can evaluation help us improve and select embedding models for specific applications?

A standard setup involves:

  • Embedding documents and queries via a dual encoder
  • Storing doc embeddings in a vector DB (e.g., Faiss)
  • At query time, embedding the question and retrieving nearest neighbors
  • Feeding results into an LLM for synthesis

→ how does the embedding model choice impact the quality of LLM-augmented answers?

6. Operational Considerations #

Embedding models keep improving (e.g., BEIR from 10.6 to 55.7). Choose platforms that:

  • Abstract away model versioning
  • Enable easy re-evaluation
  • Provide upgrade paths (e.g., Vertex AI APIs)

→ how do we future-proof embedding systems in production?

7. Text Embedding Lifecycle #

From raw strings to embedded vectors:

  • Tokenization → Token IDs → Optional one-hot encoding → Dense embeddings
  • Traditional one-hot lacks semantics, embeddings retain contextual meaning

→ how does token context influence the quality of embeddings?

8. Word Embeddings: GloVe, Word2Vec, SWIVEL #

Early methods:

  • Word2Vec (CBOW, Skip-Gram): Context windows define meaning
  • GloVe: Combines global + local word statistics using matrix factorization
  • SWIVEL: Fast training, handles rare terms, parallelizable

→ are static embeddings enough, or do we need context-aware representations?

9. Shallow and Deep Models #

Two major paradigms:

  • BoW Models (TF-IDF, LSA, LDA): Sparse, easy to compute but lack context
  • Doc2Vec: Introduces a learned paragraph vector

→ how do we encode long-range relationships and context in documents?

10. BERT and Beyond #

BERT revolutionized document embedding with:

  • Deep bi-directional transformers
  • Pretraining on masked tokens
  • Next-sentence prediction It powers models like Sentence-BERT, SimCSE, E5, and now Gemini-based embeddings.

→ what’s the trade-off between compute cost and performance in deep embeddings?

11. Images and Multimodal Representations #

  • Image embeddings from CNNs or ViTs (e.g., EfficientNet)
  • Multimodal models (e.g., ColPali) map text + image into a shared space
  • Enables querying images via text without OCR

Closing this section: **what are the infrastructure needs to support scalable multimodal embedding workflows?

12. Embeddings for Structured Data #

  • Use dimensionality reduction (e.g., PCA) or learned embeddings
  • Enable anomaly detection or classification with fewer labeled examples
  • Especially useful when labeled data is scarce

→ how can we compress structured data while retaining signal?

13. User-Item & Graph Embeddings #

  • Embed users and items into the same space for recommender systems
  • Graph embeddings (e.g., Node2Vec, DeepWalk) capture node relationships
  • Useful for classification, clustering, and link prediction

→ how do we preserve both entity and relational meaning in embeddings?

14. Dual Encoder & Contrastive Loss #

  • Most embeddings today use dual encoders (e.g., query/doc or text/image towers)
  • Trained with contrastive loss to pull positives close, push negatives away
  • Often initialized from large foundation models (e.g., BERT, Gemini)

→ how do we balance generalization vs. task-specific fine-tuning?

  • Keyword search fails for synonyms and semantic variants
  • Vector search embeds documents and queries, enabling “meaning-based” retrieval
  • Similarity measured via cosine similarity, dot product, or Euclidean distance

→ what metric and database architecture optimize for your use case?

16. Efficient Nearest Neighbor Techniques #

  • Brute force is O(N)—not viable at scale
  • LSH hashes similar vectors into the same bucket
  • Tree-based methods (KD-tree, Ball-tree) work for low dimensions
  • HNSW and ScaNN handle large-scale, high-dimensional spaces efficiently

→ how can we trade off speed vs. accuracy using ANN techniques?

17. What Are Vector Databases? #

  • Built specifically to index and search embeddings
  • Combine ANN search (e.g., ScaNN, HNSW) with metadata filtering
  • Support hybrid search (semantic + keyword) with pre- and post-filtering

→ how do we ensure low-latency, high-recall vector search at scale?

18. Operational Considerations #

  • Embeddings evolve over time—updates may be costly
  • Combine vector + keyword search for literal queries (e.g., IDs)
  • Choose vector DBs based on workload (e.g., AlloyDB for OLTP, BigQuery for OLAP)

→ how do we manage model/version drift and storage efficiency?

19. Core Use Cases #

Embeddings power:

  • Search & retrieval
  • Semantic similarity & deduplication
  • Recommendations
  • Clustering & anomaly detection
  • Few-shot classification
  • Retrieval Augmented Generation (RAG)

→ how do embeddings improve relevance and trust in LLM outputs?

20. Retrieval-Augmented Generation (RAG) #

  • RAG improves factual grounding and reduces hallucinations
  • Retrieves documents → augments prompt → generates answer
  • Return sources for transparency and human/LLM coherence check

→ how do we design RAG workflows for auditability and safety?

21. Key Takeaways #

  • Choose models and vector stores based on data, latency, cost, and security needs
  • Use ScaNN or HNSW for scalable ANN
  • Use hybrid filtering to improve search accuracy
  • RAG is critical for grounded LLMs

Closing insight: Embeddings + ANN + RAG form the foundation of trustworthy, scalable, semantic applications.