Chinmay Hebbal AI Engineer

01 / 03 MCP

GMAIL MCP SERVER

Model Context Protocol Gmail as AI-native tooling

Built an MCP server that exposes Gmail as a set of AI-native tools. Any MCP-compatible agent including Anthropic's Claude can read, search, filter and send emails through standardised tool calls over IMAP/SMTP, without any browser automation. Ships with a FastAPI-backed web UI for direct inbox management and a full Docker setup for one-command deployment.

8 MCP tools: list, search by sender/subject, get unread, get details, send, list folders
Two deployment modes: browser-based web UI (port 8000) or headless MCP server
App Password auth no OAuth flow, no browser dependency
Fully Dockerised with compose file for both modes

8 MCP Tools

2 Deploy Modes

02 / 03 A2A

MULTI-AGENT COURSE GEN

Agent-to-Agent Protocol Google ADK + Gemini 2.5 Pro

A multi-agent pipeline where four specialised agents coordinate over Google's A2A protocol to generate structured courses on any topic. The Orchestrator routes tasks to a Researcher (Google Search), a Judge (pass/fail quality gate), and a Content Builder — iterating up to three times until the research meets the bar, then streaming the final course directly to the browser via SSE.

4 agents: Researcher, Judge, Content Builder, Orchestrator each a separate microservice
Quality loop: Researcher retries up to 3× based on Judge feedback before handoff
Real-time SSE streaming from Orchestrator to frontend as course is written
All agents powered by Gemini 2.5 Pro with Google Search grounding

4 Agents

3× Quality Loop

5 Microservices

03 / 04 SLM

SLM BENCHMARK SUITE

Small Language Model Evaluation on Consumer Hardware

A rigorous benchmarking framework for small language models running locally via Ollama, targeting a 4 GB VRAM machine (RTX 3050 Laptop). Measures time-to-first-token, throughput (tokens/sec), and quality across 6 prompt categories using 5 distinct scoring strategies. Results surface in a live Streamlit dashboard with radar charts, heatmaps, and side-by-side model comparison. Qwen2.5:3B ranks #1 on quality; Gemma2:2B leads on speed.

3 models: Qwen2.5:3B, Gemma2:2B, Llama3.2 all run on 4 GB VRAM
20 prompts × 6 categories: reasoning, coding, factual, maths, summarisation, instruction-following
5 scorers: keyword overlap, exact match, code fence, structural (word/line count), ROUGE-1
Dashboard: radar chart, quality heatmap, TTFT box plot, history browser

3 Models Tested

20 Benchmark Prompts

4GB VRAM Target

04 / 04 Inference

vLLM vs SGLang vs TRT-LLM

LLM Serving Benchmark Qwen2.5-7B on RTX 5090

End-to-end benchmark of the three leading LLM serving backends across practical, overload, and extreme concurrency levels. Measures RPS, TPS, TTFT, and inter-token latency alongside quality evaluation (MMLU, GSM8K, HumanEval) with a Streamlit dashboard and Parquet export. TRT-LLM delivers 41% more TPS than vLLM at saturation; vLLM and SGLang reclaim the lead when the queue never empties.

3 backends: vLLM (FlashInfer), SGLang, TRT-LLM (compiled TRT engine, CUDA 12.9)
4 test suites: baseline sweep, fine-grained saturation, overload, and extreme overload up to c=5,000
Quality eval: MMLU, GSM8K, HumanEval direct-API fallback when GuideLLM is absent
Streamlit dashboard: KPI cards, TPS/TTFT curves, tradeoff scatter, raw data export

41% More TPS vs vLLM

22 TTFT ms @ c=1

5000 Max Concurrency

Tools & Technologies

AI Protocols & Frameworks

MCP Protocol (Model Context Protocol)

A2A Protocol (Agent-to-Agent)

Google ADK (Agent Development Kit)

Gemini 2.5 Pro

Anthropic Claude API

Ollama local LLM runtime

Backend & APIs

Python 3.12+

FastAPI + uvicorn

Server-Sent Events (SSE)

REST API design

IMAP / SMTP

Async Python (asyncio)

Infrastructure & Tooling

Docker & Docker Compose

uv Python package manager

Streamlit dashboards

Git / GitHub

Linux / Bash scripting

Evaluation & Research

LLM Benchmarking methodology

ROUGE-1 scoring

TTFT & throughput measurement

Category-aware quality scoring

Small language model research

LLM Serving & Inference

vLLM continuous batching, PagedAttention

SGLang RadixAttention, KV-cache sharing

TensorRT-LLM compiled CUDA kernels, graph capture

CUDA 12.9 & Blackwell GPU (SM 12.0)

MMLU / GSM8K / HumanEval evaluation

GuideLLM & OpenAI-compatible APIs

CHINMAY HEBBAL.

GMAIL MCP SERVER

MULTI-AGENT COURSE GEN

SLM BENCHMARK SUITE

vLLM vs SGLang vs TRT-LLM