Evolving an enterprise-ready startup stack
As an engineer at SAP.iO, I ensure that the software of our ventures is enterprise-ready. For some that may mean that user experience takes a back seat — but not so with us, we’re very design-driven. Others may think that our apps should be available for on-premise deployment — but that’s also not us, we’re totally cloud. What enterprise-ready means to us is being ready to scale overnight when a corporate user decides to move from five to 500 seats.
Consequently, we have to be very conscious in selecting the most appropriate elements of our software stack. As we invest and onboard new ventures, we continue refining our approaches, as different ventures require different choices. Based on my learnings from developing the technology stack for our geospatial analytics venture Atlas by SAP, I’m sharing my take on how we’ve evolved from a static webpage to a refined tool that our customers love.
(The Atlas Team)
Nothing beats a front-end framework for convenience, and we chose Vue. It’s not the most immediately obvious choice, but it has turned out to be a good one. Firstly, nobody on the team really likes Angular, so that’s easy. Secondly, React was, at the time we started building the application, unavailable to us because of the notorious Facebook patent license. At the time we began, Alibaba was already using Vue which was a reasonable guarantee of its longevity — in fact, Vue adoption has grown exponentially since then. Although the React license has now changed, we’re too invested in Vue to go back (not that we particularly would want to since Vue’s been great).
Of course, by using Vue we lose the ability to easily migrate to a native mobile application that React Native provides, but the sales team has already determined that “geospatial analytics on-the-go” is not much of a selling point to the average business user, or indeed, anyone. The fact that the app works on mobile is mostly a happy coincidence.
It’s also probably a coincidence if the app works on any browser other than the last two Chrome and Safari releases. Business users are not always on cutting-edge hardware and software, but sadly it’d be too much of a strain on the team to support Firefox, Edge, and IE9 (only kidding, I’m not sad about the last one at all). This takes away much of the pain of troubleshooting CSS rendering differences, and also allows us to natively take advantage of newer features, specifically, ES2017’s now-standard async/await which is a huge timesaver versus bare-bones promises and especially callback hell.
(LA divided into half a million hexagons, scored for a particular customer profile in realtime.)
In truth, part of the browser requirement also comes from the fact that we’re using Mapbox GL to do the map rendering, which as the name suggests requires WebGL. We’ve looked into running our own tile server using data from OpenStreetMap, but that would definitely be the kind of interesting academic exercise that suicides startups as we get further away from building core product.
On a side note: in general, we try to minimize the amount of building, and use open source libraries and third-party SaaS wherever possible (also read Why Open Source Isn’t an Open-and-Shut Case for Startups). So far, we haven’t encountered any case where the amount vendors charge for support or additional services (such as Mapbox Studio) exceeds the time cost of building it ourselves.
Build and Deployment
Atlas’ build process is fairly standard: a push to Github triggers Travis CI/CD pipeline which runs tests and if passing builds with Webpack. Code is then pushed to our frontend server, a low-spec EC2 machine, using AWS CodeDeploy. Nginx fronts that server, and like any sensible startup, we get free SSL certificates from LetsEncrypt.
Also like any other honest startup, we move too fast to achieve great test coverage. We have built a useful addition though: via cron, an hourly Node script that uses Puppeteer to control a headless Chrome browser to run through some common user actions and check that DOM elements are where they’re supposed to be and contain what they’re supposed to contain. This doesn’t prevent disaster but does mean that we quickly detect and mitigate it.
Nginx also proxies AJAX requests through to our backend, on the subject of which…
Like many other engineering teams, we’d love to improve our golang or finally get around to learning Elixir. Short of that it’d still be fun to build the backend in Python with Flask. We chose to go with the least fun, boring option of Node (although, at least we don’t have to deal with the boilerplate of Java/Spring).
Node has proved a solid choice, for several reasons:
- Productivity. In the early days, there’s no clear separation of back and front end, necessitating all contributors to be full stack. Less context switching between languages means higher productivity.
- Mature ecosystem. There’s a dependable npm package for almost everything. Importantly, the combination of PassportJS and express-session for authentication is industry standard and has passed our security audit.
- Scalable. Because of its single-threaded nature, we don’t need to hack parallelism into the code itself, but can — provided we keep the app stateless, one reason we’ve avoided web sockets — simply run many instances of the app behind a load balancer. If we ever need crazy scale, going serverless will be fairly painless by breaking apart bits of the app.
- Performance. Realistically, most backend endpoints are basic GETs, POSTs, and if you’re feeling particularly RESTful also PUTs and DELETEs. I mean this in the pure sense of those verbs: CRUD endpoints that are I/O and network bound. Node easily holds its own in these circumstances: load testing our EC2 m4.xlarge, a 140 dollar/month machine, we can comfortably handle 200 concurrent requests which might represent thousands of concurrent users.
There are a few endpoints in our app that are heavily compute-intensive, and here the single-threaded, interpreted nature of Node might become a problem. To handle those situations we’ve built specific microservices, so far in Python using NumPy.
Aside from those microservices, which are deployed on additional servers in Docker containers (manually until we find the time to do DevOps properly), the app is monolithic which has allowed us to iterate quickly without dependency issues and code or infrastructure duplication.
Initially, it was easy to get up and running quickly by spinning up our own MongoDB instance to handle user objects and session tokens. However, after a few months of headaches from frequent security patches, we gratefully took the aspirin of AWS’s DynamoDB. It wasn’t a drop-in replacement — Dynamo is a key-value store, so you have to JSON.stringify objects and/or flatten them out — but the migration wasn’t that big a deal.
Essentially, Atlas allows you to crunch geotagged numbers and display them on a map, so a SQL database was an obvious choice for us to house those numbers. To give you some idea, our database contains data tables that vary in size:
A mere 33,000 rows, representing approximately the number of zip codes in the U.S.; 80 million rows, the number of points of interest in the U.S.; 350 million rows: for some of the U.S.’s biggest cities, divided up into tiny hexagonal grids, the breakdown of foot traffic for each hour of each day by age, gender, and income brackets. In terms of popularity, flexibility, performance, maturity, and freedom (as in speech and beer), we’re convinced that nobody else comes close to PostgreSQL at this point in time. At first, we had no choice but to self-host: back then, PostGIS (the geographic extension to Postgres) was not available in any managed offering. As soon as the AWS Aurora database’s Postgres-compatible edition covered it, we were out of the self-hosted game faster than a tough day on HQ Trivia. The switch was straightforward (barring one obscure incompatibility with stock Postgres and once we learned to avoid AWS’s Database Migration Service in favor of stock pg_dump).
Primarily, this relieves us from having to administer the self-hosted databases, which is a big time saver and stress reliever. But we also get fantastic durability and availability guarantees that we can pass on to our customers, as well as easy horizontal scalability through read replicas. Aurora is performing at par with our local instances which were running on m4.2xlarge EC2 machines. Occasionally some latency can be seen in our staging environment on a Monday morning when nobody’s been using the database over the weekend, as swapped-out files are brought back into memory. Overall, we’re extremely happy with this evolution that’s also cut infrastructure costs by 75%.
With an eye towards the future as Atlas continues to grow in sophistication, we know we’ll have to take in schema-less data from our users, and maybe look again at NoSQL databases, or even go down the batch processing route with a data lake in HDFS queried via Hive and Spark SQL. But you can bet we’ll have hit the limits of using materialized views on Postgres’ JSON datatype before we do.
As you can see, our enterprise-ready stack is basically the same thing you’d arrive at a few days after a fairly intensive hackathon session once you’ve had a bit of time to think and build out.
So far, the stack has been incredible at striking a good balance between agility and freedom to build, with the crazy needs of processing, analyzing and visualizing big geographic data in real-time. The intersection of business software, performance, and UX is a surprisingly interesting yet normal place to be.