Day 2 – Embeddings & Vector Databases #
1. Why Embeddings? #
We begin with the core problem of representing diverse data types. Images, text, audio, and structured data all need to be compared, retrieved, and clustered. Embeddings map these into a shared vector space where similarity can be computed numerically.
→ how can we measure and preserve semantic meaning across different data types?
2. Mapping Data to Vector Space #
Embeddings reduce dimensionality while preserving meaning. For example, just like latitude and longitude embed Earth’s surface into 2D coordinates, BERT embeds text into 768D space. Distances represent semantic similarity.
→ how do different embedding models impact representation fidelity and downstream performance?
3. Key Applications #
Embeddings power:
- Search (e.g., RAG, internet-scale)
- Recommendations
- Fraud detection
- Multimodal integration (e.g., text + image)
→ how do we design joint embeddings for multi-modal tasks?
4. Quality Metrics #
Evaluation focuses on how well embeddings retrieve similar items:
- Precision@k: Are top results relevant?
- Recall@k: Do we get all relevant items?
- nDCG: Are the most relevant ranked highest?
→ how can evaluation help us improve and select embedding models for specific applications?
5. RAG and Semantic Search #
A standard setup involves:
- Embedding documents and queries via a dual encoder
- Storing doc embeddings in a vector DB (e.g., Faiss)
- At query time, embedding the question and retrieving nearest neighbors
- Feeding results into an LLM for synthesis
→ how does the embedding model choice impact the quality of LLM-augmented answers?
6. Operational Considerations #
Embedding models keep improving (e.g., BEIR from 10.6 to 55.7). Choose platforms that:
- Abstract away model versioning
- Enable easy re-evaluation
- Provide upgrade paths (e.g., Vertex AI APIs)
→ how do we future-proof embedding systems in production?
7. Text Embedding Lifecycle #
From raw strings to embedded vectors:
- Tokenization → Token IDs → Optional one-hot encoding → Dense embeddings
- Traditional one-hot lacks semantics, embeddings retain contextual meaning
→ how does token context influence the quality of embeddings?
8. Word Embeddings: GloVe, Word2Vec, SWIVEL #
Early methods:
- Word2Vec (CBOW, Skip-Gram): Context windows define meaning
- GloVe: Combines global + local word statistics using matrix factorization
- SWIVEL: Fast training, handles rare terms, parallelizable
→ are static embeddings enough, or do we need context-aware representations?
9. Shallow and Deep Models #
Two major paradigms:
- BoW Models (TF-IDF, LSA, LDA): Sparse, easy to compute but lack context
- Doc2Vec: Introduces a learned paragraph vector
→ how do we encode long-range relationships and context in documents?
10. BERT and Beyond #
BERT revolutionized document embedding with:
- Deep bi-directional transformers
- Pretraining on masked tokens
- Next-sentence prediction It powers models like Sentence-BERT, SimCSE, E5, and now Gemini-based embeddings.
→ what’s the trade-off between compute cost and performance in deep embeddings?
11. Images and Multimodal Representations #
- Image embeddings from CNNs or ViTs (e.g., EfficientNet)
- Multimodal models (e.g., ColPali) map text + image into a shared space
- Enables querying images via text without OCR
Closing this section: **what are the infrastructure needs to support scalable multimodal embedding workflows?
12. Embeddings for Structured Data #
- Use dimensionality reduction (e.g., PCA) or learned embeddings
- Enable anomaly detection or classification with fewer labeled examples
- Especially useful when labeled data is scarce
→ how can we compress structured data while retaining signal?
13. User-Item & Graph Embeddings #
- Embed users and items into the same space for recommender systems
- Graph embeddings (e.g., Node2Vec, DeepWalk) capture node relationships
- Useful for classification, clustering, and link prediction
→ how do we preserve both entity and relational meaning in embeddings?
14. Dual Encoder & Contrastive Loss #
- Most embeddings today use dual encoders (e.g., query/doc or text/image towers)
- Trained with contrastive loss to pull positives close, push negatives away
- Often initialized from large foundation models (e.g., BERT, Gemini)
→ how do we balance generalization vs. task-specific fine-tuning?
15. Vector vs. Keyword Search #
- Keyword search fails for synonyms and semantic variants
- Vector search embeds documents and queries, enabling “meaning-based” retrieval
- Similarity measured via cosine similarity, dot product, or Euclidean distance
→ what metric and database architecture optimize for your use case?
16. Efficient Nearest Neighbor Techniques #
- Brute force is O(N)—not viable at scale
- LSH hashes similar vectors into the same bucket
- Tree-based methods (KD-tree, Ball-tree) work for low dimensions
- HNSW and ScaNN handle large-scale, high-dimensional spaces efficiently
→ how can we trade off speed vs. accuracy using ANN techniques?
17. What Are Vector Databases? #
- Built specifically to index and search embeddings
- Combine ANN search (e.g., ScaNN, HNSW) with metadata filtering
- Support hybrid search (semantic + keyword) with pre- and post-filtering
→ how do we ensure low-latency, high-recall vector search at scale?
18. Operational Considerations #
- Embeddings evolve over time—updates may be costly
- Combine vector + keyword search for literal queries (e.g., IDs)
- Choose vector DBs based on workload (e.g., AlloyDB for OLTP, BigQuery for OLAP)
→ how do we manage model/version drift and storage efficiency?
19. Core Use Cases #
Embeddings power:
- Search & retrieval
- Semantic similarity & deduplication
- Recommendations
- Clustering & anomaly detection
- Few-shot classification
- Retrieval Augmented Generation (RAG)
→ how do embeddings improve relevance and trust in LLM outputs?
20. Retrieval-Augmented Generation (RAG) #
- RAG improves factual grounding and reduces hallucinations
- Retrieves documents → augments prompt → generates answer
- Return sources for transparency and human/LLM coherence check
→ how do we design RAG workflows for auditability and safety?
21. Key Takeaways #
- Choose models and vector stores based on data, latency, cost, and security needs
- Use ScaNN or HNSW for scalable ANN
- Use hybrid filtering to improve search accuracy
- RAG is critical for grounded LLMs
Closing insight: Embeddings + ANN + RAG form the foundation of trustworthy, scalable, semantic applications.