HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

MySQL in AWS RDS 100% CPU on some queries

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
queriesmysql100cpusomeawsrds

Problem

I have a managed MySQL (5.7.19) instance on AWS. In general, things work pretty well. I'm consistently around 4% CPU usage, and well below my IOPS limits for the burstable instance (t2.micro). However, if I make a query that does not use an index on a table that has likely fallen out of RAM and is on disk, MySQL will 'lock up' for a minute or so. Read IOPS do increase, but usually not enough to even dip into my credit pool. CPU will get stuck at 100% until the query completes. Other connections from the normally running service will queue up (I'll start to see 60+ connections), and many will eventually time out.

Here's an example query that locked up the database for almost a minute:

SELECT  *
    FROM  mydb.PurchaseDatabase
    WHERE  Time between '2018-11-20 00:00:00' AND '2018-11-23 00:00:00'
      and  ItemStatus=0
      and  ItemID="exampleitem";


Here's the RDS dashboard metrics when I made this query:

If I make the query a second time it completes almost instantly (it is now likely in RAM from the recent query against the same table). Similar queries are also fast again (0.173 seconds, for example). I enabled slow query logging as well as error logging and made the query a day later with the same 30 second delay (table had been paged or whatever out of RAM). However, nothing was written to the slow query table. I checked the error logs, and I can see messages like this when I make the slow queries:


2018-11-28T06:21:05.498947Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 37072ms. The settings might not be optimal. (flushed=4 and evicted=0, during the time.)

I think this might just be another symptom of the underlying problem, which is that my instance is having trouble reading/writing from disk. I am using SSD backed storage and my burst balance on the EBS volume isn't affected by these slow queries. I have lots of credits before and after the queries.

I then, foolishly, decided to try to help the database by clearing o

Solution

It was the swap! I ended up replicating the database on to the same hardware, and wrote some scripts to emulate live traffic on the database. I also ran some big queries to help fill up the buffer pool and then ensured that my replica database approximately matched the metrics of my production database. I then tried running large queries against it and it locked up, even with indexes applied. I could reproduce the issue without taking down the production service, so now I can break things as much as I want.

I noticed that running the same large queries earlier in the life of the replica database hadn't caused an issue, and tracked down the point where the locks ups begin. It happens almost immediately after the buffer pool gets large enough to push some data (OS, or otherwise) to swap on the t2.micro instance. Here's an image from Cloudwatch of the swap growing after the freeable memory drops below ~50MB or so:

Any large queries (with or without an index) start to lock up the database after the red circle. You can see the aggregate 5 minute CPU usage when I locked up the database for almost a minute performing a DELETE.

With this theory in mind, I tried two solutions:

1) I changed the value of innodb_buffer_pool_size to 375M (instead of its default AWS value of 3/4 the instance RAM size). This reduces the maximum size of the buffer pool and ensures that the database memory footprint won't grow large enough to push the OS/etc into swap. This worked!

2) I tried running the database on a larger instance (2GiB of RAM). This also worked!

Both of these solutions work, and the bonus with (1) is that I don't have to spend any extra money. I'm working on tuning the value of innodb_buffer_pool_size so that it is as large as possible without causing swap. I ran the same DELETE query in 1.2s and the database continued responding. The below screenshot wasn't even possible with the production database since the database would stop responding during these long queries, so the dashboard would never update and eventually lose connection.

Context

StackExchange Database Administrators Q#223690, answer score: 5

Revisions (0)

No revisions yet.