How Trustpilot takes a 'serverless first' approach to engineering with AWS

Trustpilot has embarked on an ambitious programme to go completely serverless with Amazon Web Services, with a bold aim to completely embrace the modern architecture by the middle of next year, accounting for what the organisation estimates could be a 10x saving on cloud compute costs.

Scott Carey Nov 29th 2018

Trustpilot has embarked on an ambitious programme to go completely serverless with Amazon Web Services, with a bold aim to completely embrace the modern architecture by the middle of next year, accounting for what the organisation estimates could be a 10x saving on cloud compute costs.

The Danish web company, which collates independent reviews for online businesses, started its serverless journey in 2016, when VP of engineering Martin Buberl came back from AWS re:Invent in 2016.

Speaking at re:Invent in Las Vegas this week, Buberl said he "couldn't have imagined standing up here" if you had asked him two years ago.

His engineering team successfully shifted to a nearly completely serverless architecture, leaning heavily on Lambda functions to reach a point where AWS is essentially fully responsible for code execution.

"Serverless was not completely new to me but the concept of serverless compute and Lambda functions really clicked for me [in 2016]," he said.

The company had already been cloud native for five years, running a high level architecture of event driven microservices and REST APIs. Now, with the addition of serverless functions-as-a-service and event queues, he felt ready to take the engineering team to what he saw as the next level.

How did it get there?

His first move was to establish what Trustpilot calls its 'engineering principles' to add 'serverless first' to its architecture.

That reads: "If serverless is not available or practical, containers are recommended. Virtual servers are considered legacy and should be avoided."

Buberl admits that the day he got back from Las Vegas with grand plans to go completely serverless there were varying degrees of excitement from his engineers, and said that he may have overlooked the all important 'why' of Simon Sinek's Golden Circle.

It's the last bit of that principle which caused most of the pushback from its population of .NET developers, who remained reliant on virtual servers.

After engaging with the firm's engineers, Buberl said: "What happened is the engineers were happier but there were still a few folks raising their eyebrows and not fully bought in." After heading back to the drawing board, the organisation opted to move to .NET Core and Docker for that team.

As a result the expanded principle reads: "We do this because we strongly believe that serverless (FaaS, BaaS, DBaaS) is the future of the cloud and we'd like to be on the forefront of that movement. Serverless might not necessarily be the right choice for everything today, but start your architecture discussions there. We're in the process of fading out virtual servers and want to avoid creating new ones."

Once they were happy with this principle they open sourced it on GitHub, where it joined others such as to code review everything, services first, build smaller things, encapsulate in contexts and expose APIs, and aim to open source.

How does this architecture look?

This new architecture relies on an API management layer and the simple notification service (SNS) pub/sub messaging service, which is tooled using GitHub and Slack.

"Github and Slack means you can immediately start using [Lambdas]," he said. So anytime anything happens a Github webhook, posts are sent out using the API gateway, where Lambda subscribes and fans out triggered actions using that SNS pub/sub mechanism, broadly speaking.

One example of how this is leveraged is for GDPR compliance. Data scientists were sometimes accidentally committing personally identifiable data within their training sets to GitHub, which would cause problems at audit. The answer is to bubble this up to Slack every time a potentially problematic commit is made to get that taken down as quickly as possible.

The company has moved to running 53 percent less virtual servers, from 180 to 95 today; 283 containers, up 354 percent from 80 in 2016, and 252 regular Lambda functions, up from 40.

Benefits

Buberl said the question he gets asked the most is whether the Lambda functions are cheaper.

The problem is, he believes Lambda triggers vs traditional cloud compute is like comparing apples with bananas.

"Effort has to go in to autoscaling systems," he said. "And we see it's hard to quantify. Then if you make mistakes and the system doesn't scale that is expensive too."

However his "gut feel" is that its serverless architecture is now "10 times cheaper" thanks in large part to the reduction in operations overhead.

The other benefits of going serverless, he said, are faster development speeds, but the biggest downside has been a loss of traceability over systems.

"We're investing in this as you have lots of smaller systems," he said, with Trustpilot now running more than 500. Today his team is using Amazon X Ray and logging to track these services, but is looking to invest in a service mesh "to bring all these systems together and map them there".

The next step is to shut down its remaining EC2 instances, which are earmarked for an end of life of Q2 2018, with "only a few Snowflake systems allowed".