Photo by Randall Bruder on Unsplash
Effective System Design for Multi-Parties Integration with Composable Components
How and Why I decided to refactor one of the crucial part of logistic service in Tokopedia
Disclaimer, this article is written on behalf of myself, not the company I worked on. Because of that, I will not share the detailed business impact results that refers to some confidential data and overall architecture inside Tokopedia. The purpose I wrote this article is to share what and how I've done the way I think it should be. This article will be opiniated that may not aligned with some people. That's okay, because I just want to share the value and principle I hold in the software engineering, or even problem solving in general. The context may be different with your problem, your app scale, and company size, but I believe that my experience I wrote in this article is worth to read.
Before going into the details, Tokopedia is one of the largest e-commerce company in Indonesia that serve more than 100 millions users per month. I am the one of the person in charge of business process in my service that handling all of the shipping fee and free shipping (known as Bebas Ongkir in Indonesia) courier allocation in the Tokopedia system. The shipping fee are usable to charge the buyer when they order a package then handing over the shipping fee to the settlement with the respective logistic partners.
Existing Condition
The service we maintained here have been lived for approximately 8 years, written in Go programming language. That is nothing wrong the software age actually, but the problem is coming from the features that have been introduced day-by-day for years that is already rooted inside the code that is very hard to remove. The features is deeply rooted to the current business logic, even though the features has been turned off for a couple of years.
After e-commerce industry evolved, the logistic partners also started to add some features to improve their pricing to keep exists and competitive with the other logistics provider as well. This directly affecting with the shipping fee calculation because we store of the partner pricing in our cache and calculate them on the fly to reduce API and network hops to the logistic partners system. This was nicely handled in the current shipping fee service, but the problem is the cache is getting more complex because so many ifs inside the method whether it is for the migration different scheme, different input, different handling for some region and so many more. Other than that, in logistics we also have multiple package handling that is also depends with the shipping fee, such as insurance or cash on delivery payment method. Let me show you the picture to visualize the shipping fee service responsibility in Tokopedia.
User Interface that served by my service.
Before calculating the shipping fee, we do multiple validations such as checking weight, distance, routes and so many featured things. After that we will check whether the cache for the request is exists or not, if the cache doesn't exists we will process the request to the partner and save the cache after the request is finished. After that we need to manipulate text, formatting, package handling modifiers as the app picture shown before.
Most of all, the current technical solution inside the shipping fee service uses the functional programming. I marked things that needed to be decoupled by refactoring with the star sign like the in the image shown below.
Existing condition inside the service.
Why do all of the function marked as the thing needed to be decoupled? Am I anti functional programming? It is a big no. Functional programming is great for its simplicity, but functional programming also preferring things that are composable and can be simplified by using the common known pattern. It means, either it is an object oriented programming or functional programming, both of them are highly encourages that a function is not only mean to solve one problem, but if possible the function needed to be generic to solve another problem easily.
Why Keep Refactoring?
As I described above it is hard to refactor, but why keep executing even though the code is still working correctly? Yes, correct, but sometimes we saw some issues that we don't know where it is coming from (hard to debug). It can reduce the service reliability. There are also some signs that our service are bit slowed down because of unnecessary validation repetition and turned off features that highly coupled (can also be the culprit of some issues). Some of the leader boards suggest that we can start fresh our service, it will absolutely going faster. It is valid argument, but I think that will be just another story like we have been going through these years. Then I started to think that if we start fresh, it will be happened again in the future. So I proposed to use my architecture design to makes it more maintainable and efficient but still managed to get the service is fast enough. That is the idea behind this initiative.
The Thought Process
The principle I always do when designing a software is to divide the complex problem by three phases, pre-processing (input), processing (the actual core business logic lived here), and post-processing (output). By dividing the phases, we can easily identify what we need to change if there is a feature update. Whether it is the input, the process or the output.
The key principle I always use in any problem solving is to decompose the things as small as possible. The keyword "small" here is not implies that the function should only do one logic, but one responsibility. It can be have more than one logic, but as long as it handles one responsibility, I think it is enough to breaking down the complex logic here. One would say that ~30 lines of code is the sweetest spot the one method should handle. I think that is a common mistakes, code responsibility can't be represented with a total lines of code. As long as the function is clear, then you are good to go. By keep anything small, we can easily replace and remove some features that aren't needed anymore. Also, we don't need to hold the thought or compute process that took so much space in our head while debugging the software.
The principle above is kind of abstract and difficult to understand. That's okay, it is because it have no concrete example. As you are solving more problem, you will get the sense of "smaller is simpler". The gut and your intuition will teach you as the days goes by.
This is the grand design to solve the problem above, it is a bit different with the actual design implemented inside Tokopedia, but that is fine we can still learn.
Preprocessing the Request
Let's start with the pre-processing the request. It is happened when the incoming request are processed. Is it need to be overrode or keep as it is, because there are certain cases that need to override the request payload due to features. In this phase also when the validation request are done, allowing us to do some generic validation such as payload data type validation, weight, distances, features, and so many things that can break the shipping and order fulfillment process. This is a really simple process, but involve so many things to be validated.
System Design of the validation logic
The image shown above is the flow within the validation process. It is really simple, so we only need to use procedural and functional call and define functions to do those simple tasks above. Starting to do some generic validator, and then we can call the validation A, override payload with A, and do the validation B and so on.
Process the Request
Continue to the core business logic lies, it is when we do the shipping fee calculation. There bunch of ways to do that and not all of the region can use the calculation. Based on that situations, I got the conclusion that the requests can be handled with several and various operations based on the buyer who wants to place an order, the seller who will fulfill the order, and the logistic partners, or even the service level that can be used to ship the package from the seller to the buyer. It has some rules that need to be fulfilled before deciding who is going to do what and where. Based on those two conditions, we can narrow it down with the requirements below.
System can decide where they should look the shipping fee (the data source);
System can handle a different case of calculation and caching mechanism; and
System have expected the same exact output even for different calculation and the data sources.
The three requirements above represents things that already exists in payment gateway system. I highly inspired with the agnostic payment gateway that built by the community written in PHP, omnipay. Those framework have the agnostic principle approach to maintain the different payment gateway and let the users decide how to use the multiple payment gateway registered in the system. This concept can help us to solve the requirement number two.
But in the omnipay, they don't have the capabilities to decide whether what is going to where. In the payment gateway context, the user is the one who will decide and select where the payment gateway be going. Then I need to build something that can decide where the gateway should going to. I also inspired on how router works, so I made with the idea of implement the static routing and dynamic routing concept. Within the knowledge, we will be able to solve the requirement number one.
To fulfill the requirement number three, it is the most simplest approach of all of them. All of the gateway should have the exact output by defining an interface and using the polymorphism approach, since Go have the capability to define an interface.
Let's see an image below, the end architecture concept we need to solve this problem.
System Design of the core process of shipping fee calculation
In the image shown above, there is shape with a different three colors. It is defined in the legend, blue for polymorphic approach, light purple for functional, and the light green one is just a separate struct to organize our code structure.
Starting from the top, we have the routers, the one that have capability to use the routing rules either it is static or dynamic. The static rules means that if the request is A, then we should go into the gateway A. In the dynamic rules one, it handle all the rules having more complex routing, such as migration, experiment, feature rollout, or region whitelists as long as the rule implementing the DynamicRuleInterface
interface. Both of the rules are preloaded in the app starts to reduce overhead in the request. The router will have the end information of the payload should go into the Gateway A / Gateway B / Gateway C and simply forward the request to its destination gateway, concurrently.
In the image, gateway is represented with the blue color, means it needed to comply with the GatewayInterface
. Just like the payment gateway has, the client should define all of the GatewayInterface
method and implement them separately. It is the polymorphism design that we don't really change as the existing condition. This time, the gateway should also have to know where to look at and how to calculate the shipping fee. Since we don't want to add more complexity to the gateway directly, I decided to breaking them down into the smaller pieces by separating the calculation logic and the cache layer handling on a different struct. Remember the philosophy, keep everything small. In these structs we can throw the cache saving asynchronously, to boost the software more faster, because it is only the data read service that doesn't needed a consistency level of transaction. It is different case with the payment, we have to wait until it is successfully wrote the data. Concurrently, we can calculate the shipping fee results to give the response to the router and combine them as a whole.
After all of the gateway successfully return the results, the router will receive and combine them as a one slice of data, then its job is done.
The approach I choose above is significantly reduce the complexity, because we are able to remove the dynamic routing rules and move into the static one if we don't need rollout or whitelists kind of feature anymore. We can separate the gateway if there is something more complex are coming ahead since we aren't capable of predicting future, but there is a room we spare for. That is the point of decoupling and breaking down the software into several components. Again, keep everything small.
Present the Output
After we receive the results, the structs and format is different than the legacy system. First, we need to introduce a simple layer to translate the structs into the desired outcome, in this example is the legacy response. Simple, easy.
Second, we need to separate the component handling as it is mentioned in the first UI I show. So defining a Decorator
interface is a helpful to modify the response is all we need. We can apply some discounts, price aggregation, labeling, adding some wording if the conditions apply. Let's see the diagram below.
Presenter system design as an adapter or decorator.
In the presenter, we can have so many structs we need to modify, manipulate and map the data as needed. In this case, two structs that have responsibility to maintain the legacy flow, and the newer one that implements a sophisticated decorator. As I explained above, the decorator pattern. This can easily make the program runs faster in several cases. Let's say we have multiple APIs that serve different purpose.
API to render the UI to user
API to calculate something and get the results to be used in another service
API to serve the original shipping fee without any component handling
By leveraging the decorator pattern, we can have different strategies to implement the three different API above in a resource efficient manner.
In the first API, since we need to render the available information to the real user, we need to use many components such as:
Labeling;
Text manipulation (price aggregation);
Applying discounts;
Rendering UI components (e.g. set disabled and show errors); and
Additional package handling information.
In the second API, calculating things can involve so many things such as additional package handling information that can be used in another service such as:
Applying discounts; and
Calculating insurances for package handling
Of course we don't need other things such as labeling, text manipulation, and rendering components that is not needed in this API. Boom, compute resource saving.
- The third API is the most simplest one, just do some shipping fee calculation and it's done. We can create some adapter to the desired response.
Some Caveats from the Internet Seniors.
There are some caveats from people that preferred to use concrete type and avoid doing abstraction. It is okay because the decision is in theirs, because the pitfalls it can be if they doing it wrong. I think both of the decision can lead to some mistakes if it is not deliberately choose the options. Either they will use abstraction and not using abstraction. Decision is a decision, if it is a mistakes, then be it. The decision is temporary, we still can change the decision and learn from that. Do the things that matters. Besides of that, there are some arguments why I keep do the breaking down and uses polymorphism approach till this day.
Using Interface (Dynamic Dispatch) is Slower than Concrete (Static Dispatch)
There are plenty of input and proof that interface type calls are much slower than the concrete type. It is because of using interface means that the compiler cannot decide and check the memory (if it implements some interface) and will decide in the runtime. Because of interface is known only in the runtime, then it is expected to have the concrete type one that is preloaded when the compile is happened.
https://medium.com/@sanjayshiradwade/understanding-dynamic-vs-static-dispatch-in-go-a5319fcdddec
To me personally it is doesn't matter if the software decoupling value is much higher than the performance drawbacks. The performance drawbacks is not that high actually, the code is still run fast enough. I will show you in the end of the article. I am the one who prefer to organize code by using interfaces and structs rather than separating app into a microservice. That will add much more performance drawbacks, network failure, unknown error, and the other uncertainties. It is really not worth to do the service separation because we will add more unneeded complexity. If the engineer and traffic is not that on the hyperscale then to me it is not worth to do.
Abstraction can make the code worse
There so much contents I saw on youtube, stackoverflow, twitter and so many more people are hate the usage of abstraction. I also hate the abstraction when it comes to inheritance because these things can make our code make much more ugly just to avoid a repetition. This is well explained by this youtube video.
https://www.youtube.com/watch?v=rQlMtztiAoA
TLDR;
Abstraction using inheritance is awful if not well managed. Instead, use polymorphism approach (defining Interface behavior) that can simplify, isolate, and decouple things more clearer. It is okay to have redundant and repeatable code.
Summary
That is the strategy I used to redesign and refactor the system I in charge on. Thankfully I implement the design with my colleagues that are incredible, clever and help me with giving advice along the way. The development is done in 2023 and will be finished rolling out to all endpoints in 2024. This initiative has shown a great metrics and the efforts given is paid off. In the middle of rollout, we can see that our service can receive almost +290% more throughput and -30% on average latency, and -500% max latency after implementing the system design changes.
Simplifying things is not only makes our code is more organized. Because everything is more clear and concise we can manage the code and remove unused modules so it won't overburden the system performance. That's all. Keep everything small to manage simplicity.