Charles Humble: Hello, and welcome to the InfoQ Podcast. I’m Charles Humble. One of the co-hosts to the show. And editor in chief at Cloud Native Consultancy Firm Container Solutions. My guest today is Lucas Cavalcanti. Lucas is a Principal Engineer at Nubank, which is the leading FinTech in Latin America and has become the most valuable digital bank in the world. He gave a talk at QCON Plus in May of this year, called Architecting Software for Leverage. And in it, he describes how Nubank’s software, team structure, and processes evolved alongside the various stages of the company, from startup in 2013 through growth, consolidation, and expansion. I was really keen to follow up with him to talk about Nubank’s programming language choices, and their architectural choices because II found the talk absolutely riveting.
As I mentioned, the starting point for this podcast was a QCON Plus talk. And the next edition of QCON Plus is coming up this November, online and in-person. As well as the incredible online experience taking place November the first through fifth. We’ll have optional in-person events the following week in New York and San Francisco. Places will be limited. QCON Plus features 12 tracks curated by domain experts, to focus on the topics that matter right now in software. Tracks covered this November include the Cloud Operating Model. Microservice patterns and anti-patterns, from remote to hybrid teams, and architectures you’ve always wondered about. And QCON Plus you can find out what should be on your radar for the world’s most innovative software leaders, driving change and innovation. Each will share with you their solutions, best practices and actionable insights to reduce uncertainty on which technologies should be part of your roadmap. We hope to see you online, or in-person this November. You can visit qcon.plus for more information. And with that – Lucas, welcome to the InfoQ Podcast.
Lucas Cavalcanti: Thank you, Charles. Very glad to be here.
Can you describe that initial technology stack with language, databases, messaging infrastructure, and so on that you started to build on at Nubank? [02:11]
Charles Humble: In your QCON Plus talk, you mentioned that during Nubank’s startup phase, an important driver was time to market. And then you listed some of your initial technology choices. Which in truth, I found quite surprising. So could you maybe describe that initial technology stack with language, databases, messaging infrastructure, and so on that you started to build on.
Lucas Cavalcanti: We are aiming for building a credit card product or eventually a bank. And on this financial domain, we started making some assumptions, right? So on that first assumption, is for banking we need to have the whole history, the whole audit trail of everything that we do. So the very first choice was the database, actually, the Datomic database. Which was very niche. It is still very niche at this time. But it is a database that don’t lose data on updates. So it keeps all the updates that happened on the data in the transaction logs. So instead of storing the data, it stores the changes on the data like a Git Repository. It works like Git, like it stores the patches that you do on the entities, instead of just replacing it.
And what you gain from free on Datomic, is that to get the whole history. And that seemed like a good property to have a FinTech, at Financial Domain Service. And Datomic is built in Clojure. So because of Datomic, we chose Clojure as the main programming language. We were going for a functional programming language, either way. We probably would have gone to Scala or some other language. If it wasn’t for Datomic being Clojure. And these choices were mostly because we believed in functional programming, because of the simplicity that it gets. And also because of how we can connect it with financial domain. While fully functional programming relates very closely to mathematics and financials in particular.
So also these were the choice, other than that, we aim to build everything in the cloud. So we chose AWS for that. And Kafka was a technology that was starting to get traction in 2013. And it had that proper check also of not losing messages. It keeps a log of the messages that you can replay again. And we also chose Kafka because of that. And the core of the decisions we took were mostly on getting the functional programming immutability and keeping data or not losing data, even on updates. This was what mostly drove our decisions.
You were using Clojure. Did you use a microservice framework alongside it, or were you just using the language on its own? [04:57]
Charles Humble: You said you were using Clojure. Did you use a microservice framework alongside it, or were you just using the language on its own?
Lucas Cavalcanti: Our initial stack Clojure, we are surprisingly not using any framework, or anything on the language. We were using a library for doing the HTTP server code, like the handling HTTP requests and so on. But we didn’t feel the need of having a framework or something like this, that usually in other language, is we end up using Spring in Java, or something like Rails in Ruby and everything. In Clojure just having an HTTP client library and HTTP server library was enough for us. Because they allow you to handle HTTP request by receiving a Map and returning a Map. And that’s simple enough and powerful enough for you to do that.
Lucas Cavalcanti: The service code didn’t need any library. And the deployment code, which is different. Clojure is Java so we can use whatever Java has. So we basically, initially bundled the code in a WAR file. And deployed in a Jetty server in AWS, already baked image that contain the Jetty server, that you could just configure to pull our file from S3. So this was our initial deployment strategy. And at the very beginning, we only had four or five services. So it wasn’t that sophisticated in the beginning. But it was leveraging baked images on Amazon and cloud formation to coordinate with the deploys.
Charles Humble: Just sticking with Clojure. Had you used the language before, or was it new to everybody there?
Lucas Cavalcanti: I saw Clojure briefly on college. I had a Functional Programming class, that I heard about Clojure there. I saw something about LISP, and classes about LISP. So I was like, Clojure is based on LSP. So basically, you get around the syntax by learning LISP. And Clojure is just learning the sender API over LISP. But I never worked with Clojure when I started Nubank. So we saw the ideas behind Clojure. I already heard about them. But I only start really using Clojure when I join Nubank. And that’s the case for 99% of the people that join Nubank.
Lucas Cavalcanti: Almost all of them never worked with Clojure before. And Clojure is a language that’s surprisingly easy to onboard with. Because it has very few language constructs, is parenthesis and functional calls. There’s not much more than that, compared to language like Scala that you have 100 language features that you need to understand sometimes. But in Clojure you have parenthesis, you have LISP that, we only need something a little bit more complex when you go to the macroworld. But you almost never need to go to use Macros.
Charles Humble: Were there any resources, or books, or that sort of thing that you found helpful in learning a language, or did you literally just pick it up?
Lucas Cavalcanti: I already had a functional programming background. I think this is something that takes longer to understand or to learn. But if you already have a functional programming background, learning the Clojure syntax and constructs is something that you could do in a week. You don’t need to read tons of books, or things like this. So there are some very good Clojure books to start appearing after we started using Clojure, not before so. And we did several mistakes in the beginning on code style, on the Code Organization Clojures. Because we’re basically using some of the other language that we use before, like some Java organization or just some Ruby organization that don’t fully apply to Clojure. And the organization we eventually came up with was very efficient. And very much different from what we get from other languages.
Did you hit any regulatory challenges as you were getting going with a new bank on public cloud in Brazil? [09:01]
Charles Humble: You mentioned as well, that you started on AWS on a public cloud infrastructure. And certainly in Europe, banking is extremely heavily regulated which can make using public cloud a bit challenging. So I was curious as to whether that was also true in Brazil, and whether you hit any regulatory challenges as you were getting going?
Lucas Cavalcanti: We were a little bit pioneers of being a financial institution on the cloud. We did have some regulations in Brazil, or we had a regulation that had a gray area – If you could store data about the customers only in Brazil, or you could do it outside or not. But the Central Bank in Brazil is an institution that was aiming for more competition, and more innovation at the time. So they were willing to understand the usage that we had in the cloud. And then start building new regulations and new rules for banks that were born into the cloud.
Just to get a quick landscape Brazil has only five banks that own 80, 90% of all the banking services in Brazil. Or at least used to be that way when we started. So the Central Bank was willing to understand new things and create new regulations. And being at the cloud at beginning, took some public policy work to talk to regulators. And we’re pioneers, and after us there arose several other FinTech popping up in Brazil. Because of those regulation updates that happened because of us. Not just because of us, but we were pretty early on that movement.
What were you doing in terms of monitoring and observability at this very early stage of the company? [10:47]
Charles Humble: You were pioneering in quite a lot of ways. I think it’s fascinating. What were you doing in terms of monitoring and observability at this very early stage of the company?
Lucas Cavalcanti: Very early, we start using the Clojure ecosystem, like leveraging the Clojure ecosystem. So there were a few tools in Clojure that allowed for real time monitoring, and transient monitoring at the time. So checking services state, checking error rates over a small period. So it was not persistent at the time. I’m forgetting the name of the tool. But it was a Clojure tooll that integrated nicely with the library that we were using for HTTP server at the time. We still use it, but this library integrated very well.
So we started with this real time monitoring, but transient monitoring. And we started to manage a Kibana log cluster for monitoring as well. And that ended up being more challenging than we were hoping for. And eventually we got to use Splunk, Which is our log aggregator service. And that worked pretty well. And then it’s working so far. I think the scale we are at, took some challenges to get to the scale we are, and use this kind of services. But I think the monitoring, at the very beginning was a real time monitoring and a log aggregator. That was pretty much it in the beginning.
You mentioned briefly that you were using Alistair Cockburn’s Hexagonal Architecture as inspiration. Could you maybe describe what that is, and how you applied it? [12:14]
Charles Humble: In your QCON Plus talk, you mentioned briefly that you were using Alistair Cockburn’s Hexagonal Architecture as inspiration, which is obviously more object oriented. But could you maybe describe what that is, and how you applied it?
Lucas Cavalcanti: One thing to mention about the Cockburn Hexagonal Architecture, is that it was born into a Java object or entered word. And just to get a context. So what we use, it’s not exactly that implementation. But it uses that idea as an inspiration. So I think on the Coburn’s idea is you have a web server. And at every operation that web server is a port and you’ll have the adapter, which a port that’s an interface. And then the above adapter is the actual implementation of that interface. And the rest is how to implement the classes implement in that. The implementation, we use that idea of separating a port, that it’s the communication with the external world from the adapter, which is the code that translate that communication to actual code that you can execute. And then the controller is the piece that gets that communication from the external world, and runs the actual business logic.
I think the Cockburn definition stops at the controller. And after the controller, it’s already business logic. Since we are working on Clojure and functional programming. We found it useful to separate pure business logic from impure, in the sense of side-effect business logic. So our controller layer is for handling side effects. And we have another layer, which is the core business logic. Which is about just pure functions, that receive Maps and return Maps, or something like that. And that separation was very helpful for us. And I do have a talk that I can send the link here later that explains that breakdown, that hexagonal architecture that we use that has this seperation.
Charles Humble: That would be great. And I’ll make sure I include a link to the talk in the show notes, when the podcast goes out on InfoQ. But it sounds like it’s hexagonal architecture inspired rather than strictly hexagonal architecture?
Lucas Cavalcanti: And we even gave a different name to not have this confusion about these ideas. So call it a diplomat architecture. Which is inspired by ports and adapters, this hexagonal architecture. But it has this functional bias, the functional programming bias of having every layer be as much functional as possible, or receiving Maps and returning Maps when possible. And not having any side effects, and keep the side effects on particular layers. Where they are more complex and you can treat them differently.
This seems to be a set of architectural choices that are driven by wanting to have an immutable architecture. Is that right? [15:12]
Charles Humble: At its heart, this seems to be a set of architectural choices that are driven by wanting to have an immutable architecture. Which actually, interestingly enough is an idea I explored recently on the podcast with Matthew Perry. So I’ll link to that in the show notes, as well. But have I got that right? Is that what you’re trying to do?
Lucas Cavalcanti: We are operating in a very complex domain. The Financial Domain is very complex by itself. And we got those ideas from the Clojure community. Rich Hickey he has a very good talk about like “simple made easy”. Which is the core of the Clojure language. Which is, we should aim for making simple constructs. Because the problem is already complex enough, right? So if you aim for simplicity, you don’t have to solve problems, because you created those problems. That you solve just the problem that you need to solve. And immutability is one thing that makes everything else simple. Because you have less things to worry about, when everything is immutable.
Lucas Cavalcanti: We use that idea on the code design, like the architecture of the code we have on the services, right? So this ports and adapters, diplomat architecture and storing immutability on most of the places. But also the infrastructure is also immutable. And this was an idea that was going on at tge time. But it was not very much used. Which is okay, you’re running on the cloud. But your machines are immutable in a sense that if you want to make any architectural change, or any deploy, or anything like that. You don’t change a machine that’s already running on the cloud. You always create a new one with the new configuration, or with the new code, or anything like that.
We use that immutable idea for the whole infrastructure as well. So every time we do a new change of architecture, or changing configurations. It’s always about killing or creating a new version of the system, with the new configuration or with the new deploy. And when that’s running, we kill the old one. And that also allows us to reason about the infrastructure with less entropy. With less things that can go wrong. Because if the new configuration goes bad, the roll back is easy. Where you just kill the new and keep the old one. You don’t have to be tracking what manual operations we will run on the services. Every new operations creates new machines, and the old ones get decommissioned or deleted.
How are you handling mutable aspects? [17:50]
Charles Humble: You obviously have some mutable aspects of the system so how is that handled? in traditional hexagonal architecture that will be in the ports? Is it in the database? Do you have a abstraction layer? How are you handling mutable aspects?
Lucas Cavalcanti: Most of the mutable aspects of the system belong into the database. It’s like mutability on the system, mutable state. They are mostly on the database. But even the database that we use, the Datomic. We also call it an immutable database. In a sense that it has a log of events that happen into it. And when you get a database, you get a value. A snapshot of the data. So it’s basically when you get a database value, it stops the time. You don’t get any new transactions on that value. So it’s immutable in that sense. When you’re using it, you do a checkout. If somebody else changes the upstream repository, it doesn’t affect you, right? Until you do a merge.
Datomic uses the same idea. You get a checkout of your data, that doesn’t change until we get the next version. So I think that’s how we’ve handle mutablity in a immutable way. Because the actual transactions get a little bit safer. Because we don’t have that state changing in place. The data changes, the entities change. But the data you get, the Maps you get from the data. They are immutable. One thing that we didn’t mention here, but Clojure by default, all the data structures are immutable. So when you’re writing your code, when you’re writing your business logic, everything is already immutable. So you don’t have to think about it. You don’t have to do anything extra to get. Is the language default. And that applies to the database as well.
We’ve already said you started on the JVM, but you started with microservice as well. Why? [19:39]
Charles Humble: Another decision that you took early on, which caused a lot of discussions actually, after your QCON talk. Was that you started with microservices. So typically, if you’re starting up a startup, you would pick something that’s quick to stand up. So the classic would be a Ruby on Rails monolith, there for something like that. You’d build that and then if it worked, and you started attracting customers. You would perhaps move to a Java or JVM-based language. And refactor to microservices. We’ve already said you started on the JVM, but you started with microservice as well. Why?
Lucas Cavalcanti: We started with the idea that we either succeed or fail miserably. So there’s no midterm. If we wouldn’t be a small size or a midsize company. It wouldn’t make much sense. So one of the idea is, we believe that at scale, if the company grew for multiple teams, or multiple places. The microservices architecture is an architecture that allows for you to have multiple teams. And so for a little bit of the organization between the teams. But that was not the only thing. We also had some restrictions on the data that we were saving. For example, we had customer data, which is personal identifiable information. And we assumed that we needed to treat it differently from the transactional data from the customer. So we also had that, and we had the authorization system that also would have a different storage requirements, or database requirements. So the initial split microservices that we did. Was mostly because we have that heterogeneous storage requirements for different kinds of data that we had.
So we had customer services that was designed to store personal information. And therefore we’ll have like encryption with REST or enough encryption so we don’t leak customer data. And we had that transactional data that was going to need scaling requirements differently and everything. And we had the appreciation data that also had a different requirement. And those were the three first services that we had. And we had a fourth one, which was geo location service. Because for the signing up process, we need to get like zip codes, and zip codes had a database for that anyway, that the initial decision that we had on separating in services, not necessarily microservices. But having our services split into the architecture, was because of this different requirements on storing data. Which we could do in the same database. But it’s harder to do it on the same database, when you have different scalability, or different encryption requirements.
As you went through rapid growth what were some of the challenges that that gave you from an architectural point of view?[22:33]
Charles Humble: And then after you launched, you went through a very rapid period of growth through. From about 2015 through 2016. Which I think was a lot faster than you anticipated. So what were some of the challenges that that gave you from an architectural point of view?
Lucas Cavalcanti: One of the major things that happen at the time, is we started leveraging a third party to run the credit card system. So we didn’t build a credit card system from scratch. We basically bought an off the shelf credit card processor system. And from the projections that we had, and from the initial conversations. We could live with that processor for a few years, before having to bring everything in-house. So the first surprise that we had, is that we had to build a full credit card system to go out of that processor much earlier than anticipated. So the first architecture thing is like, how do we migrate a third party system and in-house all the features of it? I think that was one of the first challenges. And it works like breaking a legacy system or like all those behaviors that we see on that are very challenging, architecturally. And on the business side as well, bringing every feature in-house.
Lucas Cavalcanti: And also our initial architecture for deployment also didn’t scale before we anticipated, right? So we have an architecture that could fit 500,000 customers easily. And we thought that we could live with that for also a few years. And we hit the target one year later, or two years later, than we anticipated. And we had to start a sharding project, because of the initial choice that we had on using code formation on AWS. We didn’t have Kubernetes at the time. We didn’t have Docker at the time, that we started the company. And basically, the company grew faster than we could run the project of migrating everything to Docker, or to better structures on AWS. So we could do a sharding project.
And this sharding projects was basically, we started having the customers data in the same shard. But we could have another shard with a different set of customers, and all the data on that shard. That was the decision we had on sharding, we separated by customer. Anyway, this was one of the things that we started leveraging the cloud, that you can vertically scale. Or some pods you can horizontally scale some pods. But eventually we reach a limit. And we had to do this sharding project, because we do have that limit on the cloud. It’s not infinite. And we had to do that much earlier than expected. And when we really finished the sharding project. The first shard was bigger than it should be. And that created some problems after. But that created several challenges on scalability later.
Did the decision to shard data by customer throw up some more unexpected edge cases? [25:49]
Charles Humble: You said you were sharding by customer. You’ve said already that your initial shard was unexpectedly large. I’m imagining there must be a lot of edge cases if you’re sharding data that way. For instance different amounts of activity by account or whatever. So did that choice throw up some more unexpected edge cases. Things that you were not anticipating when you made the decision?
Yeah, I think eventually we learned that not just the number of customers is a factor, but also the tenure of that customer. And the fact that the very first customers of a startup are the early adopters. Are the ones that use the product the most. So this first shard ended up being a special shard for a while. Because we had the most passionate customers that use the product a lot. We had more data, more novelty. And we had more customers on that shard, because it took too long to implement the sharding project.
For example, all the provisioning on that shard was four times, five times more than the other shards that we created afterwards. Eventually, we could make all the shards about the same size. But we did have to do performance fine tuning on that first shard. And the whole idea behind doing sharding at the beginning, was to not do any performance tuning. Because optimized code tends to be a lot more complex than the non-optimized codes. If you’re optimizing for humans is very different than optimizing for the machine. So we ended up having to do optimizations. Because the first shard is, or was a special for a long time.
You said as well that you switched to using Kubernetes for container orchestration. Are you using EKS, or did you do it in house? [27:39]
Charles Humble: Right, yeah. That actually, now you say that that’s not surprising. That I can imagine why you would miss it. And you said as well that you switched to using Kubernetes for container orchestration. So how did you go about that? Are you using EKS, or a service, or did you do it in house?
Lucas Cavalcanti: When we started using Kubernetes, there was no EKS, at least in Brazil. That’s what you get to be on the early adopting phase of technologies. We did that for Kafka. We did that for Docker. We did that for Kubernetes. So when we started there was no cloud solution for that yet. AWS didn’t have a Kafka cluster back then. AWS didn’t have a Docker container, or a Kubernetes service. And eventually now it does.
So if we were going to start today, we will do probably everything on the cloud, or using a cloud service for that. But we started doing our own Docker registry, and eventually migrated to the AWS one. We did our own Kubernetes cluster. And we are still in the process of migrating to the Amazon service. We saw the potential of those technologies. So we felt that it was worth it to be an early adopter, or early enough adopter of those. But we paid the price to do an on-premise solution before we had a cloud solution.
What does the data pipeline look like? [29:08]
Charles Humble: And presumably as you’re scaling up, your data needs are changing as well. You said you use Kafka already. So initially I was thinking, well, presumably it is Kafka Streams now. But I’m thinking were you too early for Kafka streams as well?
Lucas Cavalcanti: At the time that Kafka Streams went live at production. We already had our own Kafka style of using. And if I’m not mistaken, we already built an ETL pipeline to transform the data into a Spark cluster. So basically, since we already had to do that. It didn’t make sense to use Kafka Streams anymore, or at least not in general. I think it might make sense for specific cases. But not for having this aggregated data. And because this sharding and microservice architecture, we have two splits of data. The first one by services. So every service might have their own database, and we have shards. So every service has multiple copies of databases. So we needed an aggregator system that could get all the data about all the services in all the shards. So analysts could run queries and do analysis or all the other functions in the company could access that data, which was split. So we did need to run that ETL process. And by the time that we needed to do that, Kafka Streams was not ready. And we didn’t use it.
What are you doing in terms of the mobile and front end? What does that technology stack look like now?[30:41]
Charles Humble: We’ve mostly been focusing on the backhand. But just out of curiosity. Where you are now, what are you doing in terms of the mobile and front end? What does that technology stack look like now?
Lucas Cavalcanti: That also evolved a lot over time in the company. Today, the new code, the new thing that we are using is on Flutter. We’re using Flutter plus Dart to write new code that can run in both operating systems, right, in IOS and Android. But over time we had several technology shifts on mobile. We started with almost full Native. And by the time that React got some traction, we started a little bit with React Native. We did some changes over time, to eventually get to Flutter. But we didn’t migrate the whole app yet to Flutter. Because it’s takes a while to do that. We still have some code with the previous technologies that we use. But today we are leveraging Flutter.
Lucas Cavalcanti: I’m not a mobile developer myself. I’m a little bit far from that. But it seems to boost a lot, the productivity. We do that with the design system as well. So the Flutter technology plus having reusable components that already have the company’s look and feel. And it’s easy to plug into that. This boosts a lot, the productivity. And we don’t need mobile specialists to do that. We can have generalists doing mobile work. And we do have Mobile Specialists Team that builds the components for that design system and that Flutter. We do have reusable components that our Mobile Specialist Team builds. And then generalists can use that in a simpler way, or in a way that you don’t need to have too much mobile context to do.
With the benefit of hindsight, are there anyone or two things that, knowing what you know how you would do differently? [32:38]
Charles Humble: It’s not often that you get to do a startup from the very beginning through hyper-growth, to the point that you’ve reached now. With the benefit of hindsight, are there anyone or two things that, knowing what you know how you would do differently?
Lucas Cavalcanti: One thing that we underestimated about this system, is that financial services are a regulated industry. So we took several choices that affected the day-to-day operations of the system, so the production environment. But there were some choices that didn’t work that well in the analytical environment, right? So we made some clever code, that did some automatic fixing of invariants for the system. Which then it’s a thing that’s very hard to replicate in the analytical environment later, to see why that system self-corrected that way. And we did have some implicit operations going on. And by the time it felt like a good idea. For example, you have an accounting system from an account that should never go negative. And when it goes negative, we had a code that self-corrected it and makes sure that that was never negative. But this is something that only exists on the memory of the machine that runs the code. We don’t have traces for that particular correction.
And that ended up harming us a lot, the ability to explaining things that happened. So anyway, to say what I regret the most, is to have implicit operations that are not stored into the database, or that don’t have origination information. That was one of the earlier mistakes, the major mistakes that we had. And also not documenting decisions, like architecture decisions, and design decisions that we had. And product decisions on the first years of the company, we barely didn’t document anything. Not just the code, like we didn’t have the documentation of the code and everything. But the most important things were the decisions. The decisions exist only in my head, and the few folks that were in the beginning of the company, and not having that documentation. It hurt us a few years later when we scaled to several teams. And when the company grew big enough, that people with no context don’t know why the decision was therein. Doesn’t scale for me to explain everything to everyone every time, right? So I think that there were two things.
Charles Humble: I think having an architecture decision log is one of those things that often gets missed, it’s really valuable. Lucas that’s great. Thank you very much for joining me this week on the InfoQ Podcast.
Lucas Cavalcanti: Thank you very much.
QCon Plus is an online international software development conference spaced over 2 weeks for senior software engineers, software architects, and team leaders.
Focus on the topics that matter in software development right now.
Deep-dive with 64+ world-class software leaders.
Stay ahead of the adoption curve and shape your roadmap with QCon Plus.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.