Skip to main content

3 posts tagged with "machine learning"

View All Tags

Build Recursive Language Models on AWS in Minutes with Strands Agents and Amazon Bedrock AgentCore

· 12 min read
Manu Mishra
Solutions Architect & Applied Software Engineer

RLM on AWS Architecture

Introduction

Modern large language models face a fundamental limitation: context windows. While frontier models now reach 1 million tokens (Nova Premier, Claude Sonnet 4.5), workloads analyzing entire codebases, document collections, or multi-hour conversations can easily exceed 10 million tokens—far beyond any single model's capacity.

This post demonstrates Recursive Language Models (RLMs), an inference strategy from MIT CSAIL research that enables scaling to inputs far beyond context windows. What makes this implementation special: Strands Agents and Amazon Bedrock AgentCore reduce what could be weeks of glue code and deployment work to just a few hours of development.

Google's EmbeddingGemma on AWS Lambda - A Curiosity-Driven Experiment

· 6 min read
Manu Mishra
Solutions Architect & Applied Software Engineer

EmbeddingGemma on AWS Lambda

Note: This is a curiosity-driven experiment, not a production recommendation. For real workloads, Amazon SageMaker is the right choice. This project explores what's possible when you push serverless boundaries.

1. The idea

After my BitNet Lambda experiment, I kept thinking: what about embeddings? I had text generation working on Lambda, but what about the other half of modern AI applications?

Google's EmbeddingGemma caught my attention—300M parameters, multilingual, designed for efficiency. Could it work on Lambda? Only one way to find out.

So I fired up Amazon Q Developer and started experimenting.

Running 1.58-bit LLMs on AWS Lambda - When Serverless Meets Extreme Quantization

· 6 min read
Manu Mishra
Solutions Architect & Applied Software Engineer

BitNet on AWS Lambda

What you'll learn (tl;dr) In ~12 minutes you'll see how to deploy Microsoft's BitNet 1.58-bit quantized LLM on AWS Lambda, the container-based architecture, and performance benchmarks across different memory configurations using the microsoft/bitnet-b1.58-2B-4T model.

Big idea: 1.58-bit quantization enables LLM deployment on Lambda's CPU infrastructure. At ~1.1GB, the model fits within Lambda's constraints for serverless AI inference.