Our main goal with PingPing is simplicity. Simplicity for the customer, but also for ourselves. That also means for our servers and database. We’re always busy with improving PingPing by making changes in our code and optimizing the underlying servers. This time our database needed some love and we decided to switch to a different and more powerful server.
We have our main infrastructure running on Hetzner Cloud. A really great cloud provider with its own data centers in Finland and Germany. This gives us the advantage that our servers, database, and Redis servers are all close together. Next to that we also have a bunch of servers running on the Vultr cloud provider. We use these servers to verify if your website is actually down. The load on these servers is a lot lower. The load on our important servers, the database, and Redis are high since we keep monitoring more and more websites.
We monitor all our servers with Prometheus and visualize the data with Grafana. Below you can find a screenshot of the old database server. Here we’re looking especially to the IOPS (IO Operations per second). The results are not that bad, but there is still some room for improvements here if it comes to the disk usage of the server.
On cloud servers (or shared virtual servers) you always share the underlying host systems with other customers. So even if they have really fast disks available on their server you never get the full performance power of using the server on your own.
In the graph, you can see we never get more than ~1.5k IOPS. At this point, this is enough for the amount of data we handle, except for one case. We run millions of checks every week so we have a lot of records in our database. We optimize our database on a daily basis to speed up the database and clear indexes. This way we keep the database as fast as possible. This took around 30 minutes to complete this task. It’s a long time because we need to go over the whole database. Spoiler! We got this down to 2 minutes.
The specs of our old server were already pretty good. For most applications, it would perform perfectly. To provide the best and fastest uptime checking we also need a good server. Below you can find the specs of the old and the new server.
|Old setup||New setup|
|Server||Hetzner Cloud: CX51||Hetzner Dedicated: AX41-NVME|
|Processor||8 vCPU (Intel® Xeon® Gold)||AMD Ryzen 5 3600 6 cores a 3,6 GhZ|
|RAM||32 GB||64 GB|
|Disk||240 GB Disk space||2x 512 GB NVMe SSD|
|Speed||1 GBit/s (shared)||1 GBit/s (guranteed)|
|Average system load||7-9||1-2|
Another benefit of the new server is that we have a dedicated server for our own. No more sharing with others, we 100% own the server for our own usage. A real beast!
Or, how we actually needed to do this three times. Yeah, you read that right. We needed three attempts to get it right.! Ouch.
We first performed the migration twice manually on our testing environment. Within these test migrations, nothing really happened and everything worked as expected. We noticed it took around 20 minutes, so we were ready to schedule a migration for our production environment.
We prepared a checklist and divided the tasks between all of us. We sent out an email to all of our customers so they knew there might be some downtime because of the migration. We were ready to go for our migration.
After chatting for a while in a Google Meet we started the migration. Everything went smooth. The migration took a bit longer than expected because we also ran an optimization of the tables. After around an hour, the migration was done. Everything looked fine the first few minutes. The server was flying and we never had such a good performance on the database.
After around 30 minutes, strange things started to happen. In our monitoring, we noticed a lot more locked rows. We started investigating but couldn’t really find anything. One thing was for sure; this wasn’t right! Luckily we still had the old database available. We moved the database back to the old database server, and everything went back to normal.
PingPing was running again, so we now had the time to investigate. We identified the problem. We tried to switch from MySQL 5.7 to Maria DB. They should be equal, but the threading model from MariaDB is totally different. This broke our setup because we have many short-living threads and don’t want to cache them.
Back to the drawing board.
This time we used MySQL 8 instead of MariaDB. We were scared that this might give some issues as well since we previously used MySQL 5.6. We use Laravel Forge to manage our servers, and they only provide MySQL 8, Maria DB, and PostgreSQL.
After some testing, we tuned some settings based on our old database server and scaled up our testing environment to stress test the new MySQL server. Everything seemed fine, so we were ready to go or another migration try.
This time without notifying our customers, we planned a new migration. We knew it would take around 20 minutes, and we just wanted to make sure the new database server was up and running. We pulled a dump from the old database server and started the migration.
We have many charts available in our Grafana dashboard where we can see the MySQL statistics and the import progress. Our database is around 10 GB. We know it takes around 20 minutes, so you can expect that after 5 minutes, you have around 2.5GB imported.
After 5 minutes we saw this:
The graph tells us that we only imported 130MB in the last 5 minutes. Wait what…? This is way less than expected. That was really confusing. We logged into the MySQL server, and everything seemed fine in there, except for the size of the database.
We didn’t want to guess after the migration, and we did a rollback to the previous MySQL server again. We can say that we’re really good at rolling back databases now ;). So PingPing was back up running on the old MySQL server again. Why didn’t the database grow? Why is it not importing all the data?
We were curious since the database wasn’t doing what it should be doing. Then we found the issue:
Did you know that MySQL as of version 8.0.0 cache the tables/database sizes by default for 24h?
Basically, the import went just fine but we were looking at cached statistics. Luckily there is a setting for this called
information_schema_stats_expiry. We set it to 30 seconds and restarted the new MySQL server. Then again, we did a new import on the new server, and everything seemed to work as expected.
We wanted to be sure, so we did the import a few more times. The new MySQL server seemed to be up and running and ready to be used. Time for attempt three.
As you can imagine, we now were confident enough to do a new migration. We migrated in around 15 minutes, and all PingPing traffic was going through the new server.
We’re very impressed by the performance of this piece of hardware. We’re confident we can use this MySQL for a longer period for PingPing and scale with us when PingPing grows.
Since the database performs so much better, we also needed fewer workers to perform the same amount of uptime checks. We still pay a similar price for the server as we did before but with much better performance. We can thank the dedicated MySQL from Hetzner for that.
Migrations are hard. There are so many steps and details involved, and something can go wrong quickly. Luckily we had a plan of approach and a checklist, so we were ready to roll back if something went wrong.