Day 2 – Embeddings & Vector Databases

Day 2 – Embeddings & Vector Databases #

1. Why Embeddings? #

We begin with the core problem of representing diverse data types. Images, text, audio, and structured data all need to be compared, retrieved, and clustered. Embeddings map these into a shared vector space where similarity can be computed numerically.

→ how can we measure and preserve semantic meaning across different data types?

2. Mapping Data to Vector Space #

Embeddings reduce dimensionality while preserving meaning. For example, just like latitude and longitude embed Earth’s surface into 2D coordinates, BERT embeds text into 768D space. Distances represent semantic similarity.

→ how do different embedding models impact representation fidelity and downstream performance?

3. Key Applications #

Embeddings power:

Search (e.g., RAG, internet-scale)
Recommendations
Fraud detection
Multimodal integration (e.g., text + image)

→ how do we design joint embeddings for multi-modal tasks?

4. Quality Metrics #

Evaluation focuses on how well embeddings retrieve similar items:

Precision@k: Are top results relevant?
Recall@k: Do we get all relevant items?
nDCG: Are the most relevant ranked highest?

→ how can evaluation help us improve and select embedding models for specific applications?

5. RAG and Semantic Search #

A standard setup involves:

Embedding documents and queries via a dual encoder
Storing doc embeddings in a vector DB (e.g., Faiss)
At query time, embedding the question and retrieving nearest neighbors
Feeding results into an LLM for synthesis

→ how does the embedding model choice impact the quality of LLM-augmented answers?

6. Operational Considerations #

Embedding models keep improving (e.g., BEIR from 10.6 to 55.7). Choose platforms that:

Abstract away model versioning
Enable easy re-evaluation
Provide upgrade paths (e.g., Vertex AI APIs)

→ how do we future-proof embedding systems in production?

7. Text Embedding Lifecycle #

From raw strings to embedded vectors:

Tokenization → Token IDs → Optional one-hot encoding → Dense embeddings
Traditional one-hot lacks semantics, embeddings retain contextual meaning

→ how does token context influence the quality of embeddings?

8. Word Embeddings: GloVe, Word2Vec, SWIVEL #

Early methods:

Word2Vec (CBOW, Skip-Gram): Context windows define meaning
GloVe: Combines global + local word statistics using matrix factorization
SWIVEL: Fast training, handles rare terms, parallelizable

→ are static embeddings enough, or do we need context-aware representations?

9. Shallow and Deep Models #

Two major paradigms:

BoW Models (TF-IDF, LSA, LDA): Sparse, easy to compute but lack context
Doc2Vec: Introduces a learned paragraph vector

→ how do we encode long-range relationships and context in documents?

10. BERT and Beyond #

BERT revolutionized document embedding with:

Deep bi-directional transformers
Pretraining on masked tokens
Next-sentence prediction It powers models like Sentence-BERT, SimCSE, E5, and now Gemini-based embeddings.

→ what’s the trade-off between compute cost and performance in deep embeddings?

11. Images and Multimodal Representations #

Image embeddings from CNNs or ViTs (e.g., EfficientNet)
Multimodal models (e.g., ColPali) map text + image into a shared space
Enables querying images via text without OCR

Closing this section: **what are the infrastructure needs to support scalable multimodal embedding workflows?

12. Embeddings for Structured Data #

Use dimensionality reduction (e.g., PCA) or learned embeddings
Enable anomaly detection or classification with fewer labeled examples
Especially useful when labeled data is scarce

→ how can we compress structured data while retaining signal?

13. User-Item & Graph Embeddings #

Embed users and items into the same space for recommender systems
Graph embeddings (e.g., Node2Vec, DeepWalk) capture node relationships
Useful for classification, clustering, and link prediction

→ how do we preserve both entity and relational meaning in embeddings?

14. Dual Encoder & Contrastive Loss #

Most embeddings today use dual encoders (e.g., query/doc or text/image towers)
Trained with contrastive loss to pull positives close, push negatives away
Often initialized from large foundation models (e.g., BERT, Gemini)

→ how do we balance generalization vs. task-specific fine-tuning?

15. Vector vs. Keyword Search #

Keyword search fails for synonyms and semantic variants
Vector search embeds documents and queries, enabling “meaning-based” retrieval
Similarity measured via cosine similarity, dot product, or Euclidean distance

→ what metric and database architecture optimize for your use case?

16. Efficient Nearest Neighbor Techniques #

Brute force is O(N)—not viable at scale
LSH hashes similar vectors into the same bucket
Tree-based methods (KD-tree, Ball-tree) work for low dimensions
HNSW and ScaNN handle large-scale, high-dimensional spaces efficiently

→ how can we trade off speed vs. accuracy using ANN techniques?

17. What Are Vector Databases? #

Built specifically to index and search embeddings
Combine ANN search (e.g., ScaNN, HNSW) with metadata filtering
Support hybrid search (semantic + keyword) with pre- and post-filtering

→ how do we ensure low-latency, high-recall vector search at scale?

18. Operational Considerations #

Embeddings evolve over time—updates may be costly
Combine vector + keyword search for literal queries (e.g., IDs)
Choose vector DBs based on workload (e.g., AlloyDB for OLTP, BigQuery for OLAP)

→ how do we manage model/version drift and storage efficiency?

19. Core Use Cases #

Embeddings power:

Search & retrieval
Semantic similarity & deduplication
Recommendations
Clustering & anomaly detection
Few-shot classification
Retrieval Augmented Generation (RAG)

→ how do embeddings improve relevance and trust in LLM outputs?

20. Retrieval-Augmented Generation (RAG) #

RAG improves factual grounding and reduces hallucinations
Retrieves documents → augments prompt → generates answer
Return sources for transparency and human/LLM coherence check

→ how do we design RAG workflows for auditability and safety?

21. Key Takeaways #

Choose models and vector stores based on data, latency, cost, and security needs
Use ScaNN or HNSW for scalable ANN
Use hybrid filtering to improve search accuracy
RAG is critical for grounded LLMs

Closing insight: Embeddings + ANN + RAG form the foundation of trustworthy, scalable, semantic applications.