Principle AI Inc Engineering Blog

What is Model Context Protocol (MCP)?

Tanvi Nadkarni — Thu, 10 Jul 2025 01:38:39 GMT

https://www.youtube.com/watch?v=sShDyqX7Y-0

If you have more questions about AI and want AI mentorship please reach out to us on principle-ai.com

Model Context Protocol

Anthropic has released this protocol that allows GenAI models to use external tools and incorporate the response of those tools into their response.

What is agentic AI ?

Tanvi Nadkarni — Sat, 25 Jan 2025 20:28:42 GMT

https://www.youtube.com/watch?v=68ANA-TbRik

Introduction

Agentic AI is AI with agency. Which means instead of simply asking a specific problem to be solved, an AI system is given a broad goal and asked to solve it. For example “organize a birthday party”. The AI is supposed to use its own agency to get things done. As AI becomes better, we have reached a point where agentic AI has become a theoretical possibility and with some constraints a practical reality.

In this post we will give you more detailed overview of what agentic AI is and what to expect in coming few months.

What makes an AI agentic ?

For an AI to be agentic it needs to have following characteristics.

Autonomy
Goal-oriented
Adaptability
Decision-making

Agentic system should be autonomous in a sense it should without a human constantly making decisions for it. Occasional human permission should be fine but the agent itself should work totally on its own,

Agentic system should always work towards a specific goal and not exist in vacuum. For example, one goal could be to “buy these sneakers when they are the cheapest”.

Adaptability refers to human like ability to react to unexpected changes in environment. It should not just fail but should rather look for alternatives.

Self driving cars is a great example of agentic system

Google’s fully self driving Waymo cars are a great example of a complex but very real agentic ai system. They can drive from point A to B very reliably without any input from a human and adapt to changing road conditions.

Dangers of Agentic AI

There are dangers associated with autonomous AI systems. As like all AI systems they can make mistakes, not realize it is a mistake and double down on the mistake. For example, an agent asked to book a good flight deal might book it multiple times wasting lot of money. Also, the agent might get tricked by bad actors into giving away credit card numbers. This is why agentic AI needs rigorous testing and good constraints to guard against such failures.

Some upcoming AI Agents

Salesforce and others are teaming together to create virtual employees that can take over a lot of work from real humans from sales, customer service to after sales support. A lot of this work is open ended and yet somewhat constrained.

Many banks are looking at replacing their customer support and real bank staff with AI agents. Bloomberg reported that the banking sector might look at 200K jobs replaced by AI agents pretty soon.

Design Instagram Feed using Generative AI

Tanvi Nadkarni — Fri, 24 Jan 2025 17:42:06 GMT

At PrincipleAI, we stay one step ahead on all things AI, so you don’t have to worry. We bring you handpicked AI-related news and technical deep dives. Our most popular series is about system design. In this post, we are going to design an Instagram-like feed using generative AI.

Introduction

The real world AI deployments are complex. We have done a video teaching the basics of such deployment here. But these days interviewers ask questions to get a sense of your knowledge about generative AI.

We need to design an Instagram-like feed using generative AI. In this task, a user should see posts ranked in a way that encourages more engagement. The user follows their friends on the platform, so we need to include all relevant posts from them. We also need to mix in trending posts from across the platform that align with the user's personal interests. We determine these interests based on the user's actions within the app.

Solution

The solution has two parts. First, we need to store the user's potential posts in the database, which we keep in a vector database. We also store trending posts in the same database. When it's time to generate the feed, we query the vector database to get a list of posts.

The second part of the solution is ranking. This is done by the model. Since we use a large language model, we treat the model's logic as a black box. We can use multi-shot prompting to explain to the LLM the types of posts the user has interacted with and ask the model to rank the given set of posts based on that information.

Video

https://www.youtube.com/watch?v=NQeJIpBdpQY

Preparing for System Design : Starting point

Tanvi Nadkarni — Tue, 07 Jan 2025 04:15:28 GMT

We have prepared an elaborate list of terms you need to learn as you start your system design journey. We are not going to explain these terms to you but simply list them out. We might cover them in detail in future posts. So subscribe to the blog if you want them.

If you do not know what any of these terms mean, it means you are not prepared. So spend time learning it. Use it as a checklist.

1. Fundamental Concepts

Scalability: Ability of a system to handle growing amounts of data and traffic.
Reliability: Ability of a system to consistently perform its intended function without failure.
Availability: The percentage of time a system is operational and accessible.
Consistency: Ensuring all users see the same data at the same time.
Efficiency: Optimal use of resources (CPU, memory, bandwidth) to achieve performance.
Latency: Time taken for a request to be processed and a response received.
Throughput: Number of requests a system can handle per unit of time.
CAP Theorem: States that it's impossible for a distributed system to simultaneously provide Consistency, Availability, and Partition tolerance; you must choose two.
ACID Properties: (Atomicity, Consistency, Isolation, Durability) A set of properties that guarantee reliable database transactions.

2. Architectural Patterns

Client-Server: A model where clients request services from a central server.
Microservices: Breaking down an application into small, independent services.
Message Queues: Components that allow asynchronous communication between services.
Load Balancing: Distributing traffic across multiple servers to prevent overload.
Caching: Storing frequently accessed data in a fast-access location.
Databases (SQL, NoSQL): Organized collections of data for storage and retrieval.
- Relational Databases (SQL): Structured data organized in tables with relationships.
- NoSQL Databases: Flexible schema for unstructured or semi-structured data.
CDN (Content Delivery Network): A geographically distributed network of servers that store copies of website assets to deliver content faster to users.
Saga Pattern : A distributed asynchronous processing pattern.

3. Specific Technologies & Concepts

REST APIs: A standard for building web services that use HTTP methods (GET, POST, PUT, DELETE).
HTTP: (Hypertext Transfer Protocol) The foundation of data communication on the web.
DNS (Domain Name System): Translates domain names into IP addresses.
TCP/IP: The suite of protocols that govern the internet.
WebSockets: Enables real-time, two-way communication between client and server.
Sharding: Horizontal partitioning of a database to distribute data across multiple machines.

4. Design Considerations

Vertical Scaling: Increasing the resources of a single server (CPU, RAM).
Horizontal Scaling: Adding more servers to a system to handle increased load.
Database Indexing: Creating data structures to speed up data retrieval.
Data Replication: Creating copies of data to improve availability and fault tolerance.
Rate Limiting: Controlling the rate of traffic to a system to prevent abuse and overload.
Circuit Breaker: A pattern to prevent cascading failures in a distributed system.

Databases

MySQL: Open-source, widely used, good for general purpose web applications.
PostgreSQL: Advanced features, strong SQL compliance, good for complex data models.
Oracle: Enterprise-grade, highly scalable, expensive but robust.
MS SQL Server: Microsoft ecosystem, good integration with .NET, strong for transactional workloads.
MongoDB: Flexible schema, good for document-oriented data, popular for web and mobile apps.
Cassandra: Highly available, fault-tolerant, good for distributed systems and high write throughput.
Redis: In-memory, very fast, excellent for caching and real-time data.
Amazon DynamoDB: Fully managed, scalable, pay-as-you-go, integrated with the AWS ecosystem.
Neo4j: Handles relationships efficiently, good for social networks, recommendation engines.
Elasticsearch: Powerful search and analytics, good for log analysis, full-text search.
ClickHouse: High-performance analytics, good for large datasets and complex queries.

Caches

1. CPU Cache: Small, very fast memory within the CPU that stores frequently used instructions and data. (e.g., L1, L2, L3 caches)

2. Disk Cache: Portion of RAM used to store frequently accessed data from the hard drive.

3. Web Cache (Browser Cache): Stores website assets (images, scripts, etc.) locally on the user's browser to speed up page loading.

4. Server-Side Cache: Caches data on the server to reduce database load and improve response times. Object Cache: Stores specific data objects like database query results. Page Cache: Caches entire web pages or fragments of pages. * Opcode Cache: Stores precompiled code to avoid redundant compilation.

5. CDN Cache: Stores copies of website content on servers geographically closer to users for faster delivery.

6. DNS Cache: Stores DNS records locally to speed up website lookups.

7. Distributed Cache: A cache that is spread across multiple servers, often used in large-scale systems. (e.g., Redis, Memcached)

Microservices

1. Microservice: A small, independent, and loosely coupled service that performs a specific business function.

2. API Gateway: A single entry point for all clients, handling routing, authentication, and rate limiting for microservices.

3. Service Discovery: A mechanism for microservices to locate and communicate with each other dynamically.

4. Service Mesh: A dedicated infrastructure layer for managing communication between microservices, providing features like load balancing, service discovery, and security.

5. Containerization: Packaging a microservice and its dependencies into a container (e.g., Docker) for easy deployment and portability.

6. Orchestration: Automating the deployment, scaling, and management of microservices, often using tools like Kubernetes.

7. Circuit Breaker: A pattern to prevent cascading failures by stopping requests to a failing service.

8. Distributed Tracing: Tracking requests across multiple microservices to monitor performance and identify bottlenecks.

9. Event-Driven Architecture: Microservices communicate through events, enabling asynchronous and decoupled interactions.

10. CQRS (Command Query Responsibility Segregation): Separating read and write operations to improve performance and scalability.

Message Queues

Message Queue: A temporary storage buffer for messages waiting to be processed.
Producer: An application that creates and sends messages to a queue.
Consumer: An application that receives and processes messages from a queue.
Queue Depth: The number of messages currently held in a queue.
Message Broker: A middleware component that manages the flow of messages between producers and consumers. (e.g., RabbitMQ, Kafka)
Acknowledgement (ACK): A signal from a consumer to the queue that a message has been successfully processed.
Dead-letter Queue: A queue that stores messages that failed to be processed.
Message Priority: Assigning different levels of importance to messages for processing order.
Message Durability: Ensuring messages are not lost even if the message broker fails.
Message Ordering: Guaranteeing messages are processed in the order they were sent.

Master Protocol Buffers for System Design

Tanvi Nadkarni — Thu, 02 Jan 2025 00:36:19 GMT

In system design interviews high level design is relatively simpler if you know the basic building blocks. For low level design you are expected to design the API interfaces. How would you do that ? A lot of people either like to talk in terms of REST apis or a programming language specific APIs.

However, I recommend candidates to master protocol buffers as a way to describe your API interfaces. Even if you plan to use REST and not GRPC, it is trivial create an adapter around GRPC to support REST interfaces. But why is ProtoBuff so important ?

Protocol Buffers are typed JSON is not.

If you are using JSON to describe your API input output, you will encounter the problem of describing types to the interviewer.

For example consider a photo object.

// Photo Object
{
  fileName: "",
  contentType: "", 
  base64EncodedImage:""
  author: { .... }
}

// Video object 
{
  fileName: "",
  contentType: "", 
  base64EncodedVideo:""
  author: { .... }
}

The problem here is that how do you communicate types of different fields ? What exactly is an authors object ? Also how do you represent the fact that Author object is similar for both Video and Photo ?

Compared to this check this

message Photo {
  string fileName = 1; 
  enum ContentType contentType = 2; 
  bytes imageData = 3; 
  Author author = 4;
}

message Video {
  string fileName = 1; 
  enum ContentType contentType = 2; 
  bytes videoData = 3; 
  Author author = 4;
}

message Author {
 long userId = 1; 
 repeated Role roles = 2; 
}

enum ContentType {
  BINARY_PNG = 1; 
  BINARY_MP4 = 2; 
}

Not only this is extremely concise, it also answers a lot of questions about the types and reusability of objects. It is also better to store protobuff in storage because they optimize a lot of storage related aspects.

If you propose to use Cloud Spanner database in GCP, then you can actually describe your database schema using protocol buffers as well.

Similarly, if you want to describe API interfaces it is much better to describe them as gRPC interfaces using protocol buffers instead of REST.

Protocol Buffers

syntax = "proto3";

service MultiMediaService {
  rpc UploadPhoto (stream UploadPhotoRequest) returns (UploadPhotoResponse) {}
  rpc UploadVideo (stream UploadVideoRequest) returns (UploadVideoResponse) {}
}

message UploadPhotoRequest {
  bytes chunk_data = 1; // Chunk of the photo data
  string file_name = 2; // Name of the photo file
  string content_type = 3; // MIME type of the photo (e.g., "image/jpeg")
}

message UploadPhotoResponse {
  string photo_id = 1; // Unique ID for the uploaded photo
  string photo_url = 2; // URL to access the photo
}

message UploadVideoRequest {
  bytes chunk_data = 1; // Chunk of the video data
  string file_name = 2; // Name of the video file
  string content_type = 3; // MIME type of the video (e.g., "video/mp4")
}

message UploadVideoResponse {
  string video_id = 1; // Unique ID for the uploaded video
  string video_url = 2; // URL to access the video
}

This avoids the problem of describing HTTP endpoints, proper HTTP methods and JSON input output. Not only this is a better design but also it makes the system design interview more efficient by saving on time.

Pitfalls

However, it is always a better idea to ask your interviewer if they have any preference between JSON and ProtoBuffs. If the interviewer is not familiar with Protocol Buffers it might only confuse them. In that case stick to JSON.

Design a centralized authentication system

Tanvi Nadkarni — Mon, 30 Dec 2024 07:10:44 GMT

https://www.youtube.com/watch?v=Dh11cxb7WYs

Problem statement:

Design a centralized authentication system for millions of users. The objective is to only design authentication system and not authorization system. More than 1M daily active users and 100M calls to verify authentication token.

Solution Summary:

We use a simple JWT based authentication. An AuthenticationService handles most complex operations. We help scaling by using an in memory hot cache that keeps all the active JWTs. A CryptoService is used to sign and verify tokens by cache reduces calls to this service.

Adding LLM to Spring Boot : Start using AI today

Tanvi Nadkarni — Mon, 30 Dec 2024 02:34:00 GMT

Web developers are curious about how to integrate the power of Large Language Models (LLMs) into their projects. It's becoming increasingly clear that LLMs will play an important role in shaping user experiences across all kinds of apps. However, a lot of engineers also do not where to start and how to start experimenting.

In this article we will provide you simple steps and code to start using open source LLMs with your app. Before you understand how to use LLMs in your app, please watch our high level design video that describe how AI can be used in your app.

Ollama

Ollama is an open source tool that makes it simple for you to run any open source model on your server. Open source models such as Meta’s LLama, Google’s Gemma etc. are essentially very large files that you load in memory and using their own APIs you ask them questions. Each open source model has a different interface, different capabilities.

As an application developer it is best not to concern yourself with low level details of the model apis. However, you might want to experiment with different models to figure out which one is better for your needs. Y

Ollama is a wrapper around these open source models and lets you pick and chose the model and create a uniform interface around it.

Model	Parameters	Size	Download
Llama 3.3	70B	43GB	`ollama run llama3.3`
Llama 3.2	3B	2.0GB	`ollama run llama3.2`
Llama 3.2	1B	1.3GB	`ollama run llama3.2:1b`
Llama 3.2 Vision	11B	7.9GB	`ollama run llama3.2-vision`
Llama 3.2 Vision	90B	55GB	`ollama run llama3.2-vision:90b`
Llama 3.1	8B	4.7GB	`ollama run llama3.1`
Llama 3.1	405B	231GB	`ollama run llama3.1:405b`
Phi 3 Mini	3.8B	2.3GB	`ollama run phi3`
Phi 3 Medium	14B	7.9GB	`ollama run phi3:medium`
Gemma 2	2B	1.6GB	`ollama run gemma2:2b`
Gemma 2	9B	5.5GB	`ollama run gemma2`
Gemma 2	27B	16GB	`ollama run gemma2:27b`
Mistral	7B	4.1GB	`ollama run mistral`
Moondream 2	1.4B	829MB	`ollama run moondream`
Neural Chat	7B	4.1GB	`ollama run neural-chat`
Starling	7B	4.1GB	`ollama run starling-lm`
Code Llama	7B	3.8GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8GB	`ollama run llama2-uncensored`
LLaVA	7B	4.5GB	`ollama run llava`
Solar	10.7B	6.1GB	`ollama run solar`

This gives you an idea of how useful Ollama is.

Using Ollama with Spring Boot

Ollama is a library that you can install as a command line tool. Once done, you can use that interface to start a server that gives you a REST based interface.

Install ollama

curl -fsSL https://ollama.com/install.sh | sh

Start the ollama server

./ollama serve

# In a seprate shell

./ollama run llama3.2

You can then query this server.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt":"Why is the sky blue?"
}'

Integrating with Spring Boot

To use ollama with spring boot you should create a separate server that just runs Ollama and use REST interface to make your Spring Boot talk to the server.

Spring Boot has a helpful library that allows you to interface with Ollama server. Just add the following line to your gradle.build file.

dependencies {
    implementation 'org.springframework.ai:spring-ai-ollama-spring-boot-starter'
}

Once you add this dependency you can configure the ollama parameters right into your spring boot application.

Configuring Ollama parameters.

Property	Description	Default
spring.ai.ollama.base-url	Base URL where Ollama API server is running.	`localhost:11434`

Then you can define your custom controller where you can call the Ollama Chat api.

@RestController
public class ChatController {

    private final OllamaChatModel chatModel;

    @Autowired
    public ChatController(OllamaChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @GetMapping("/ai/generate")
    public Map generate(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
        return Map.of("generation", this.chatModel.call(message));
    }

    @GetMapping("/ai/generateStream")
    public Flux generateStream(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
        Prompt prompt = new Prompt(new UserMessage(message));
        return this.chatModel.stream(prompt);
    }

}

OllamaChatModel is a standard API to chat with the model. UserMessage implies the prompt that comes from User. We will cover the actual usage of these APIs in future posts. But this gives you a good starting point.

Function calling with Ollama

Ollama is smart enough to actually call methods based on output of the LLM. For example you might want to ask LLM current temperature in a city and then present it to the user. For this LLM might have to call a method inside your app. One way to do this is by creating a function and letting LLM know that it is available. When it needs to be called is smartly determined by LLM.

@SpringBootApplication
public class OllamaApplication {

    public static void main(String[] args) {
        SpringApplication.run(OllamaApplication.class, args);
    }

    @Bean
    CommandLineRunner runner(ChatClient.Builder chatClientBuilder) {
        return args -> {
            var chatClient = chatClientBuilder.build();

            var response = chatClient.prompt()
                .user("What is the weather in Amsterdam and Paris?")
                .functions("weatherFunction") // reference by bean name.
                .call()
                .content();

            System.out.println(response);
        };
    }

    @Bean
    @Description("Get the weather in location")
    public Function weatherFunction() {
        return new MockWeatherService();
    }

    public static class MockWeatherService implements Function<WeatherRequest, WeatherResponse> {

        public record WeatherRequest(String location, String unit) {}
        public record WeatherResponse(double temp, String unit) {}

        @Override
        public WeatherResponse apply(WeatherRequest request) {
            double temperature = request.location().contains("Amsterdam") ? 20 : 25;
            return new WeatherResponse(temperature, request.unit);
        }
    }
}

Source : https://spring.io/blog/2024/07/26/spring-ai-with-ollama-tool-support

Conclusion

Spring boot and Ollama play well together and make it extremely simple to use AI in your apps.

Integrating Gen-AI into your applications - Part 1 - The foundations

Tanvi Nadkarni — Sat, 21 Dec 2024 10:02:58 GMT

Integrating generative AI into your existing applications is one of the major challenges of all businesses. Generative AI has a lot of capabilities and it has simplified how we can use machine learning solutions in usual business workflows. In past this was extremely complex as it involved training different models on different data to solve specific problems. In this article series we are going to explore this problem in more detail with several concrete use cases.

💡

If you find this series useful. You can subscribe to our newsletter and youtube channel to show your appreciation

The foundations

Generative AI could mean many things. Large Language Models or LLMs is the most basic type of generative ai technology but we might also have something like Imagen or Midjourney or Sora or Veo. But very likely we are going to use a LLM for a lot of generative ai tasks and hence we will focus on LLM usecases first.

Once you have decided that you need to use GenAI in your application the next job is for the product team to figure out which are some of the areas where Gen AI can help your product get better and more useful. We will create a more structured framework for this later, but first we need to create an engineering foundation that different product teams can rely on to meet their needs.

LLM service

Our advice to organizations is to have single entry point into all your Gen Ai related work. This means a single team is responsible for maintaining the Gen AI inference infra in the organization and owns a generic set of inference APIs other teams could use.

This is sensible because LLMs themselves have extremely simple interface. They take free from text as input and produce text as output. There is really not much to it ultimately.

An LLM service is responsible for running inference for everyone against the models hosted by this service. Such a microservice is maintained by a single team and provides service to multiple other teams in the organization.

Experimentation, analytics and AI safety can thus be implemented at this entry point more uniformly. Also given that inference is expensive we might also want to assign quotas accordingly to each consumer depending on the business need.

Even if you are using third party APIs are like Google’s Gemini, you still need to control access to such a system and standardize how other teams use these GenAI technologies. So it makes sense to create an LLM wrapper around it.

Session storage

LLMs are all about context and hence we might also want to store some contextual data about an LLM prompt. This can be done using a concept called session. This is especially useful if you are going for a conversational AI and LLM needs to access the previous state of the prompt output. Hence we need a good way for the consumers to inform the LLM about the context of the query.

This can be modeled how we model a session in typical web applications. Each consumer “creates” and “ends” a session and the backend service is able to chain all the events that happen with that session Id.

Analytics and logging

Since Gen-AI technologies are experimental, we need a lot of real time data about how our Gen AI solutions are doing in wild. We can achieve this by logging all Gen AI interactions to a central place where they can be analyzed at a very high level of granularity by our researchers. This data can also be used to fine tune the future models.

Conclusion

We have put together a video with more details of this sort of LLM Service.

https://www.youtube.com/watch?v=NUlLDY16QZU

Next Part: Data collection, fine tuning and experimentation.

Url shortening service on budget

Tanvi Nadkarni — Wed, 18 Dec 2024 04:16:18 GMT

There are lot of different system design answers for a url shortening service at scale. However, if you are on budget and you want to create something that is simple, cheap and low maintenance here is an interesting one.

We use a simple API gateway for creating the a short url. A url is hashed to a url safe short string and an Amazon AWS S3 object is created. As you know, S3 allows you to set metadata for every S3 object including a 302 url to which the request must be redirected if it is a public object.

S3 objects can be fronted using CloudFront CDN which makes everything global.

We create a bucket “mydomain-urls” and for every new url we create an empy object in the bucket.

mydomain-urls/ and set its metadata to 302 redirect url.

CloudFront can then be setup for this bucket.

We also need to setup a 404 url for the bucket.

Advantages

No database required.
No hot cache/redis or multi region deployments needed. (This is offloaded to cloudfront)
One single server is sufficient for a reasonable load.
The “Read” flow is pretty much handled by CloudFront for us which has very good SLA
CloudFront provides both rate limiting and DDoS protection.

Disadvantages

No real time stats about requests available but can be done using logs processing.

Saga Pattern - Architecture guide

Tanvi Nadkarni — Tue, 17 Dec 2024 08:00:21 GMT

Microservices driven architecture is getting increasingly common. Microservices are small independent services that do one thing and one thing right with very clear API boundaries. A complex system is built by using multiple microservices calling each other. I think of this as lego bricks being used to build a more complex object.

This approach allows for better people management. Small teams can work on their independent code base without worrying too much about other teams. This reduces need to synchronize with other teams and communication overhead.

The disadvantages of microservices is that it might leads to inefficient allocation of compute power, increase RPC communication overhead and debugging might become harder as bug in one system might propagate to other callers.

Saga Pattern

In this post we are going to discuss the architectural pattern called “Saga”. Saga in english means a long story of heroic achievement. Modern businesses using technology are often like this. Long chain of multiple events leading to different things. For example user searching for flight tickets and then choosing one, booking one, doing payment, later cancelling the flight etc.

Saga pattern inherently is designed for such complex scenarios to break them down into smaller pieces and help design small self contained microservices.

In user’s perspective this is a single journey they are taking. In system’s perspective this is one of the possible paths.

A typical travel website supports millions of searches per day but only thousands of tickets get booked each day. Out of those thousands few hundred will eventually get cancelled. So the scalability needs for each of these steps are very different.

Similarly, there are different reasons to send out mails to the customer. Booking emails, payment emails, changes to flight itinerary emails etc.

To make sense of these steps there are two subtypes.

Choreographed Saga

This phrase comes from the world of dance. A dance is a series of steps that each participants take. Every dancer does their own moves and trusts the partner to do their moves. There is no centralized co-ordinator forcing them to make specific moves. It is all determined by convention.

In microservices Choreographed Saga is designed as each micro-service doing its work and reporting its results in a particular communication channel. If the micro-service fails to do its job it reports the failure as well. Any other micro-service that depends on this has to be responsible for taking steps depending on the success or failure result.

For example Booking system might record user’s desire to book a ticket. Hold the ticket and ask user to complete the payment. The payment system reports success or failure result on the channel that booking system is responsible for listening and take next steps accordingly. The failure path handling is often called “compensatory actions”.

Here Booking system asks payment system to process a payment and respond on the reply channel. If the payment system fails to process the payment the booking system will remove the ticket hold. If the payment succeeds then booking system will mark the ticket as sold.

What if the payment succeeds but Booking system fails to mark the ticket as bought ? The booking system is then responsible for issuing a refund request to the payment system.

How the payment system handles payment or refunds, what emails it might send to the user etc. is not determined by Booking system. It only sees the payment system as two micro-service calls. “ProcessPayment” and “RefundPayment”.

Similarly, Payment system does not care at all about why the payment is being made. It only does two things it is being asked to do.

This reduces complexity and makes thing much simpler to engineer.

Pros:

Good scalability as there are no centralized systems.
Development and testing can be independently done.

Cons:

Changes to communication protocol is harder.
Bugs can have a domino effect.
Communication design is complex.

Conductor Saga

This phrase comes from the world of music where a conductor instructs everyone on what to do. This conductor is a centralized broker who knows all the steps and their logic and is responsible for actions. This makes it easier for engineers to code the logic of flows in a single system.

In this system broker tells Booking to hold a ticket and it asks the payment to process a payment. Booking system does not know about the existence of Payment system. The broker will check the results of payment and ask booking to do whatever that needs to be done. Broker will also decide what email communication needs to be sent out.

Conductor pattern is useful in cases where the communication logic might be changing often and is far too critical to be left to distributed systems to implement correctly.

Pros:

You have all the flow logic at single place
Debugging is easier
Bug detection is easier

Cons:

Conductor is a single point of failure
Conductor code gets complex with time
Scalability has challenges

Why use Saga at all ?

Nearly all modern highly scalable systems end up using Saga pattern in some form for at least some of their problems. Saga’s benefits are that you do not have to rely on any long held locks. You can scale or rearchitect each system independently while keeping them loosely coupled.

Modern business needs are often a good justification as businesses often are run as complex long running processes involving multiple events.

To give you more context we have a deep dive into Saga pattern here.

https://www.youtube.com/watch?v=zu9aFuJ0iNQ

Zero shot vs multi shot prompting?

Tanvi Nadkarni — Sun, 15 Dec 2024 17:29:27 GMT

One way to build classifiers in past was to provide a large number of examples to the model and let model train through some kind of training algorithm. Such training is complex and time consuming.

Large language models are massive and have learned a lot of patterns. This can be leveraged to simplify the process of training a new model. Instead of training a new model from scratch, you can give a LLM some examples on the fly and the model is able to learn the patterns from those few examples. The speed comes from the fact that LLMs are trained on massive amount of data and have learned the “meta skill” of learning from patterns.


Given these examples

Jon,13 to { "name": "Jon", "age":12 }

Emma,9 to { "name": "Emma", "age":9 }

Convert

Roger, 14

Nearly any LLM today can learn from this and give you the following answer.

{ "name": "Roger", "age": 14 }

Doing the same with traditional models would be reasonably hard for open ended json parsing.

Even though example of multi-shot prompting here is simplistic this works pretty well even for more complex systems including cases were humans themselves might not be able to see the real conversion logic.

Models like Gemini can support upto 1M tokens. This is a very large context window and allows for many multi-shot examples.

Zero-shot learning and few-shot learning represent two ends of the spectrum in LLM optimization. Zero-shot learning operates purely on the general knowledge acquired during pre-training, requiring no additional task-specific data. While this approach is advantageous for its efficiency and simplicity, it often struggles with tasks that require a nuanced understanding of domain-specific information. [source]

Zero shot prompting relies only on the the embedded logic of the LLM. It is not necessarily the bad idea to rely on the embedded logic of the LLM. Since LLMs have been trained on lot of existing data, they might be better than your examples to generate certain type of outputs.

Can you generate shells command to find the string "foo" 
and replace it with "bar" everywhere in a text file ?

Output:

sed -i 's/foo/bar/g' your_file.txt

Or translation

Convert “hello” to Spanish.

Zero shot is applicable when you don’t actually have examples to provide nor you want your model to be constrained by any specific examples.

Applying few shot prompting in real world applications

One great way to apply few shot prompting in real world is customer success scenarios. Given 1000 examples of last resolves customer complains, can we predict what might be the solution to the incoming customer complaint or can we match the customer with the right service representative ?

Note that choice of examples in few shot is of paramount importance. If you give wrong and irrelevant examples your precision and recall goes down. So you need to carefully design your prompts.

Another example is e-commerce product recommendations. Last 1000 customers’s purchases looked like this. If a person has following 10 items in his cart recommend another product they should add to their cart.

Historically we used collaborative filtering to solve such recommendation problems.

References:

Designing a notification system

Tanvi Nadkarni — Sun, 15 Dec 2024 07:52:09 GMT

This is an interesting system design problem. Often this question gets asked in the context of delivery and the answer to that is relatively simple. Delivering notifications involve calling Apple and Google endpoints. This is failure prone and hence should be done as retriable async jobs. One of the solution is just to put it into a message queue meant for each of the services and let some dumb workers simply call the Apple and Google endpoints to deliver these notifications.

But this problem is not just about delivery but rather also the design. In a e-commerce company, there are multiple types of notifications. One set is critical notifications such as OTPs, Password recovery, Payment info, Delivery info etc. Another set of notifications could be marketing notifications based on complex scenarios. Different teams might want to create different type of “campaigns” to send notifications through various channels. Doing A/B tests with these notifications should also be possible.

To solve this problem we propose a Campaign Management System where such teams can design campaigns and set targeting criteria based on ETL events and other properties. An evaluation service that matches users with campaigns and schedules notifications.

A rendering engine is responsible for templates being converted into actual payload.

Finally a criteria engine ensures that the criteria is matched during delivery time right before final delivery.

We have done a in depth deep dive on this topic in the following video:

https://www.youtube.com/watch?v=EN4u_QLi168

Asynchronous Communication in system designs

Tanvi Nadkarni — Sat, 07 Dec 2024 22:40:07 GMT

Async communication is one of the most important aspect of any complex system. Nearly all system design questions I have encountered in the wild have used this pattern in some form.

To understand what async communication is you need to understand what it is really useful for. Our systems often consist of flows that involve multiple steps. For example, fetch a sitemap and then crawl thousands of pages in the sitemap. Then index those pages. Each of that step has a preceding step. But those steps also independent in a sense, first step just fetches the sitemap. It does not really care what happens after it. The second step cares about the sitemap but not how the sitemap was fetched.

In many synchronous flows such as say money transfer, each step is linked to other steps. For example you withdraw money from sender’s account and then deposit in receivers account. Together this is called a money movement transaction. Second step can not happen unless first step has happened and the first step must also be failed if second step fails. This is called synchronous flows.

Asynchronous communication is achieved using two primary architectural components. Firstly, message queues and second id pub/sub interface. Pub/Sub interface ultimately is designed using message queues and can be seen as an abstraction on top of message queue.

We have done this video to explain to you how message queue based async communication works.

https://www.youtube.com/watch?v=8moVErQn6LM

Please subscribe to our newsletter to stay updated with every new tutorial or major video we make.

If you need us to make a video about any specific system design problem do write us an email.

System Design : Feature Flag System

Tanvi Nadkarni — Sat, 30 Nov 2024 18:07:02 GMT

https://www.youtube.com/watch?v=Uoxje_Fnb0I

https://www.youtube.com/watch?v=nMQXXN8U7F0