Diagnosing poor application performance is an art and a science. Below you will find my personal short list of common causes of poor app/site/API performance compiled over decades of late nights debugging sessions and too much coffee.Wells Burke
In the summer of ’97, I got my first real exposure to what it takes to build fast, high-scale web apps as a junior developer on IBM’s cyber cast team for the US Open Tennis Tournament. I spent weeks painstakingly developing beautifully manicured cross-browser-compatible HTML templates for just about every page of the site, including a bunch of hellacious nested table layouts for the marquee real-time box-scoring feature. Boy, was I proud of my code! When we landed in NYC for the three weeks of the tournament, I still had no idea how much I was about to learn.
One of the first on-site team members I met was a guy named Nick who had an IBM Power PC laptop running AIX. That was some pretty rarified nerd gear back then and I immediately wanted one. Nick was part of the IBM team building the web hosting infrastructure for the 1998 Nagano Winter Olympics; he was there to use the US Open as a testing platform to see just how far they could scale traffic before the infrastructure and thus the user experience started to break down.
Nick and team had a super-cool trick. Remember those HTML layouts I made for real-time box scores? Well, on low-traffic tournament mornings, these guys would incrementally dial up the HTML refresh rate on the scores pages and use the web browsers of hundreds of thousands of unsuspecting tennis fans around the world to generate a nearly unlimited volume of concurrent requests. They were setting new internet traffic records every day; it was incredible. “The Slashdot Effect” was already popular nerd parlance, but I don’t think the term “denial of service attack” had even been coined yet.
So, what does this have to do with my prized HTML? Shortly after setting up our on-site office/war room, Nick called me over to show me something on his badass laptop he was super-excited about. He’d written a script to strip out all of the whitespace from every HTML page I’d authored, reducing the page byte size considerably. I was horrified; my beautifully indented layouts had been mutilated. Smiling, he said. “It’s like I always say, save 30% of your homepage size and add 30% more users.” Fortunately, my 21-year-old wounded ego kept its mouth shut long enough for me to learn a valuable lesson… At least, that’s how I choose to remember it.
And that was the start of a career that’s often involved diving headfirst into performance and scalability problems. While NO-SQL internet-scale databases and hyper-scale cloud platforms have banished many of yesterday’s problems, so much of the investigation game is the same. It seems that as our stacks’ capabilities have grown, they’ve introduced their own sets of nuances that often capture the most mind share when breaking down a problem. We need to remember to double-check the basics. Here’s my list of slow apps “usual suspects”—as relevant today as back in 1997—plus some new items courtesy of today’s fabulously updated tech.
Job #1 is to determine if your app server request processing time is actually slow or if one or more client-side inefficiencies are creating a perception of slowness. The answer will probably be both, but go for the biggest wins first.
Optimize for first meaningful paint
- Poorly structured CSS/JSON can force all page assets to load before the first meaningful rendering in the browser, making it look like the entire page is slow.
- Tools built into Google Chrome, such as Lighthouse, give you x-ray vision into this type of problem.
Unnecessarily large payloads
- Use media queries, img srcset, and other responsive techniques to load the appropriately sized images for your client device—e.g., don’t load desktop images on mobile devices, even though your code may autoscale it and it looks right: it is a huge waste of bandwidth and time. There are plenty of tools out there that will automatically resize your images and cache them in a CDN for you.
Too many assets / connections
- If your design uses a large number of small rastered icons/images, consider using CSS sprites.
Poor use of HTTP caching headers
- Most modern CMS platforms (WordPress, etc.) are pretty good at setting generically appropriate cache-control and eTag headers on static assets so browsers don’t continually try to reload them. It is always a good idea to double-check these settings, especially if your custom logic ends up serving static content.
Too many database queries
- If you are running hundreds of queries to generate a page (an easy thing to do with something like WordPress and WooCommerces plus additional plugins), there is almost no way to make the site perform and scale.
- Even for highly dynamic sites, the majority of the dynamic content does not actually change from request to request.
- If possible, extract any personalized content into AJAX calls so the dynamic content on the core page can be cached on the server side for future requests.
Lack of a CDN
- CloudFlare—enough said. OK, we like CloudFront too if you are native AWS, but CloudFlare’s free CDN is basically impossible to beat.
- For all third-party JS libs, fonts, etc. that you’re not going to embed in your code, make sure you link to their CDN sources.
- SSL still takes a lot of compute resources. SSL offloading moves the SSL termination off of the servers that are processing your business logic and moves it to a load balancer or some other type of proxy. If your requests are slow but your app server CPU is not maxed this is still a best practice, but is probably not the culprit.
Undersized app server worker thread pool
- If you are working on a platform like Java/Spring Boot, when moving from development to production it can be really easy to overlook prod configurations such as the number of work threads of the JVM memory allocation. If your worker thread pool is undersized compared to CPU resources, you could be artificially forcing requests to queue up while CPU cycles go underutilized.
Cold starts for serverless functions
- Serverless deployments with AWS Lambda, Azure Functions, and the like can be awesome, but you have to pay attention to potential “cold start” times that are the straight up overhead of the cloud infrastructure. If you are running in a serverless environment and are seeing inconsistencies in response times, particularly if your traffic is naturally bursty, you may want to check to see if your longer response times correlate to launching new serverless runtimes. If your traffic is consistently slow across the board, this could be a contributing factor, but is probably not the main culprit.
Running your app logic and database on the same server
- It feels like this should not even make the list anymore, given that separating these functions has been a best practice for over 20 years, but amazingly we still run across legacy apps with this configuration all the time.
Mismatch between database compute and IOPs
- For cloud databases in AWS, Azure, and other clouds, you can dial in exactly how much CPU and IOPs you want. In a cloud context, your IOPs limit the number of disk reads/writes your db can perform. If your CPU is not maxed out but your queries are slow even though they are appropriately indexed, there’s a good chance your IOP allocation is maxed. It’s the cloud equivalent of having super-fast CPUs and super slow disk drives.
Undersized database connection pool
- This is another deployment configuration that is easy to overlook. If your individual queries are executing quickly according to the slow query log but the round-trip time seems to be taking a long time in your code, your logic thread may be blocking waiting for a connection. Double-check that you have a reasonable number of connections compared to the number of server worker threads.
Missing DB indexes
- When you have identified specific slow pages or API calls, if they are connecting to a relational database, make sure that your table indexes cover all of your query patterns.
- While NoSQL databases can really help eliminate RDBMS scalability problems, almost all of them have their own partitioning or indexing strategies that you must pay attention to. You can still end up doing the equivalent of full table scans in DBs such as Azure ComosDB and AWS DynamoDB. That will KILL performance… and probably your budget, too.
Hot NoSQL partitions
- Often a companion to many NoSQL indexing strategies is a partition strategy. Poorly designed partition keys can result in “hot partitions” that receive an oversized position of the request traffic, causing those queries to slow down. Make sure your key strategy matches your data access patterns in a way that maximizes the spread of requests across partitions.
So, these are some of my “usual suspects,” the list of easily recognizable problems that may contribute to poor app/site/API performance. I would love to hear what’s on your list!
If you are having trouble with slow applications or struggling to scale up, contact us. We would love to see if we can help.