Tuning Cipher to 1M TPS for ease of scaling of online authentications (Part 2)
We went through the overall idea behind the demo in Part 1. Part 2 will inform us about how we achieved the feat, highlighting the technologies that we used along the way. Heads up, it is going to be engineering focussed.
Quick recap: Cipher demonstrated the successful handling of a chart-busting 1 Million Transactions Per Second of online authentication requests utilizing a prudent cost of $200/hour of AWS cloud infrastructure. The simulated authentications are 240x higher than the authentication output put together by all of the big players in India during online payment transactions.
Things we engineereed in a nutshell
- Optimized systems for a minimum unit for processing a desirable number of online authentications
Built the infrastructure
Setup the simulation environment and node distribution mechanisms
Processed successful authentications for the transactions
2. Scaled the minimum unit for the desired load capacity of 1 Million TPS
- Nginx — A web server used as an ingress controller (to accept HTTP requests).
- EKS (Elastic Kubernetes Service) — An AWS-managed service to run Kubernetes.
- Prometheus — An open-source monitoring system with a dimensional data model, flexible query language, efficient time-series database, and modern alerting approach.
- Grafana — An open-source analytics & monitoring solution for databases and microservices.
- Metrics server — A cluster-wide aggregator of resource usage data for auto-scaling horizontal pods.
- We’ve presented JVM stats, HTTP requests & connection stats from each of these apps via metrics endpoint and made this a scrape target in Prometheus.
- We’ve also enabled the JMX port on each of these microservices to monitor them in real-time.
‘Node’ is a container of a single server where multiple microservices can be run and monitored to successfully process a definite amount of authentication requests. We configured each of these nodes at 64GB RAM & 16 GB processor and deployed them on Kubernetes.
Each node had several pods running within it, which hosted the microservices required for authenticating online transactions. Amongst the microservices, there are three notable ones.
- Mastercard Connector — To handle the ACS protocol specific nuances and serve as a connector for the Mastercard card scheme
- Edith — To orchestrate the intricate authentication plans
- Cerberus — To serve as the Identity Provider and the core authentication engine
Although processing 1 Million TPS was the objective, we also had to generate that much amount of transaction load for doing that. So, we used Gatling as the load generator. We let the Gatling Master spread across four zones, spread across in Mumbai and Singapore, for running 50 tasks of individual test scripts to authenticate bank transactions coming in from card networks.
The Gatling injector simulated the interactions that a card network usually has with its ACS provider while authenticating online transactions. It also reproduced the cardholder interactions required for authentication. The system-level simulations are -
- VEReqs on behalf of the card network
- PAReqs on behalf of the card network
- Challenge-response submissions as performed by cardholders to prove their authenticity
The simulator routed these requests to auto-scalable clusters of Cipher microservices, which were configured with dummy BINs.
These are the steps for a successful authentication flow.
- The simulator verifies (with VEReq) if the card under authentication is enrolled with Cipher. If it is, the simulator initiates the pair authentication request (PAReq) and gets the authentication URL from Cipher.
- Gatling simulates the Swipe to Pay (S2P) interaction, generating the necessary credentials required for completing the authentication.
- Gatling submits the S2P challenge and receives the authentication response.
- It redirects the control back to the card network module that’s simulated.
A minimum unit consisted of a specific number of instances of every microservice. The function of this minimum unit was to handle a definite number of transactions.
To come up with this unit, we tested each microservice individually to get a measure of the number of authentication requests it could serve in a given time period with specific resources allocated to it. The minimum unit that we came up with successfully handled 20K authentication requests.
1 minimum unit = 3 units of Edith, 2 units of Mastercard Connector, 2 units of Cerberus
Here is the resource utilization data when Cipher was comfortably serving 20K authentications.
Scaling of the minimum unit
Now that we had a stable unit that could serve 20K TPS comfortably, we scaled it to accommodate additional authentication requests. We used this unit as the base reference for serving transaction requests from one issuer.
With this logic in place, we scaled the minimum unit up to 50 such units, handling 20K TPS each, to serve 50 issuers, leading up to 1 Million TPS. The 4 Gatling servers simulated the load for 500 distinct BINs spread across 50 issuers, each having 1,00,000 cards. The entire setup, consisting of 4 Kubernetes clusters, spanned across 2 AWS regions with two availability zones, respectively.
With AWS, the nodes aren’t scalable within minutes. But since we had to attain that, we warmed up the nodes by scaling up the pods. We used 350 of the pods within 100 nodes to handle the heavy load of 1 Million authentication requests. When the load was less, we scaled down the nodes by cutting down the pods based on the resource utilization factors using Horizontal Pod Autoscaler.
- The authentication’s write I/O was done asynchronously considering the purpose of the live demonstration.
- Three requests effectively make one authentication.
If you’d like to check out the test data that we used for simulating authentications, it is available for reference here.
It took our team of 8 people only 9 days to successfully deliver the desired results. We’re glad to have made some key decisions during this process that allowed us to quickly achieve what we wanted — like choosing AWS, Kubernetes, and Gatling Enterprise. We hope to work on several such innovative initiatives that redefine the future of payments in our country and beyond. Thanks for reading and hope you found this helpful!