---
title: "Resolving a year-long ClickHouse® lock contention"
excerpt: "Here's how we fixed a problem that had stumped us for over a year and saw a 5x performance improvement in the process."
authors: "Jordi Villar"
categories: "Engineering Excellence"
createdOn: "2023-10-27 00:00:00"
publishedOn: "2023-10-30 00:00:00"
updatedOn: "2025-04-24 00:00:00"
status: "published"
---

<p>This is a story about completely saturating a ClickHouse® replica (and being happy about it). We were able to fix a long-running issue that limited our query concurrency, and we increased CPU utilization to 100% in the process.</p><h2 id="the-unknown-limit">The unknown limit</h2><p>The story begins about a year ago. One of our ClickHouse® clusters started to malfunction, causing some queries to slow down significantly. At first glance, everything seemed fine. We checked the usual bottlenecks - CPU, memory, I/O - and all checked out. But still, at a certain number of requests per second, query responses on the replica would slow down. We couldn't figure out why. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/cpu_average--1-.jpg" class="kg-image" alt="A Grafana dashboard showing average CPU utilization for a ClickHouse® cluster over the span of 12 hours. Average utilization was well below 20%." loading="lazy" width="773" height="346" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/cpu_average--1-.jpg 600w, https://tinybird-blog.ghost.io/content/images/2023/10/cpu_average--1-.jpg 773w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The CPU was just chillin' at &lt;20%.</span></figcaption></figure><p>We spent days digging into different metrics, checking query performance, assessing required resources, tweaking cloud provider settings, and many other things. We had many more hypotheses than we had real solutions. Maybe we could find and resolve some hidden bandwidth limits or improve concurrency by limiting the number of threads used per query? Everything was pointing to some kind of contention, but we couldn't see it. <em>"We must be missing something"</em> was a common refrain in those days.</p><p>As engineers, we would have loved to dive deep into this issue, understand what was happening, and develop some novel fix. But we have to balance our itch to solve problems with some pragmatism; we still have a <a href="https://www.tinybird.co/product" rel="noreferrer">real-time data platform</a> to build, customers to support, and features to release. So we made a quick fix, slightly changing the cluster setup and isolating some workloads so we could move on to the next GitLab issue.</p><h3 id="some-context-about-tinybird-and-clickhouse">Some context about Tinybird and ClickHouse®</h3><p><a href="https://www.tinybird.co" rel="noreferrer">Tinybird</a> is a real-time platform for data and engineering teams. We give our customers the power to unify data from many source systems (including databases, data warehouses, and streaming data), develop <a href="https://www.tinybird.co/blog-posts/real-time-analytics-a-definitive-guide" rel="noreferrer">real-time analytics</a> with SQL, and publish their queries as low-latency, scalable APIs.&nbsp;</p><p>We are focused primarily on streaming ingestion and real-time use cases; our clients can ingest millions of events and make thousands of concurrent API requests per second. Under the hood, we use <a href="https://github.com/clickhouse/clickhouse">ClickHouse®</a> as our primary analytical database and real-time SQL "engine". ClickHouse® is an incredibly powerful open-source database, and we only use a subset of its features. Still, with the scale we support and our customers' latency and throughput requirements, we often push it to the limit.&nbsp;</p>
<!--kg-card-begin: html-->
<blockquote>ClickHouse® is an incredibly powerful open-source database, and we only use a subset of its features. Still, with the scale we support and our customers' latency and throughput requirements, we often push it to the limit. </blockquote>
<!--kg-card-end: html-->
<p>As such, we frequently contribute to the ClickHouse® project by developing performance improvements that help us continue to support our customers' scale. </p><h2 id="finding-the-root-cause">Finding the root cause</h2><p>Despite our quick fix, this resource contention became a recurring issue. We've been able to dodge it by deploying more quick fixes, but that changed a few weeks ago. To be able to support a particular use case for one of our clients, we were approaching a scale that made it impossible to ignore&nbsp; It was time to invest in the time and resources to fix it for good.</p>
<!--kg-card-begin: html-->
<blockquote>We had been able to avoid fixing this problem for almost a year, but certain customers began approaching a scale that made it impossible to ignore.</blockquote>
<!--kg-card-end: html-->
<p>Every time we had faced this issue before and explored various metrics and profile events, we ended up in the same dead end. We saw lock-related profile events such as <code>RWLock*</code> and discarded them since we knew they were not the root cause but merely a symptom.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/context_lock_events.jpg" class="kg-image" alt="A Grafana dashboard showing lock-related profile events. A small spike of ContextLockWait events is evident." loading="lazy" width="1293" height="770" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/context_lock_events.jpg 600w, https://tinybird-blog.ghost.io/content/images/size/w1000/2023/10/context_lock_events.jpg 1000w, https://tinybird-blog.ghost.io/content/images/2023/10/context_lock_events.jpg 1293w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">When examining lock-related profile events, we noticed a small spike in </span><code spellcheck="false" style="white-space: pre-wrap;"><span>ContextLockWait</span></code><span style="white-space: pre-wrap;"> events</span></figcaption></figure><p>This time was different. We noticed a small peak of <code>ContextLockWait</code> events. After all this time surveying typical server metrics for clues and coming up empty, we finally had something promising!</p><p>Unfortunately, we didn't have a good way to measure its impact on performance. Unlike <code>RWLock*</code> events, which have <code>*WaitMilliseconds</code> properties that indicate how long they've been waiting to acquire the lock, <code>ContextLockWait</code> events don't come with a time-based equivalent that would have let us measure the time waiting for those locks. So we were flying blind, with no way to understand the consequences of these events.</p>
<!--kg-card-begin: html-->
<blockquote>In ClickHouse®, <code>ContextLockWait</code> events didn't come with an equivalent time-based wait property, like <code>RWLockWaitMilliseconds</code>, that would have allowed us to measure the performance impact of these events.</blockquote>
<!--kg-card-end: html-->
<p>We decided to dump all threads during the incident and see if we could extract something from there. After a first look at the dump, it was clear that <code>ContextLock</code> was, at least, one of the issues.</p><p>So, we built a small reproducer to report the issue to the ClickHouse® repository to help the contributors with the fix. While working on the reproducer, we sent the first improvement upstream, <a href="https://github.com/ClickHouse®/ClickHouse®/pull/55029">adding a profile event to report Context Lock waiting time</a>. With this metric available, it became easy to find a query that, given sufficient concurrency, would cause the same contention we were seeing in our clusters. This moved our first roadblock out of the way.</p>
<!--kg-card-begin: html-->
<blockquote>We built a small reproducer and pushed a change to add a waiting time profile to <code>ContextLock</code> events in ClickHouse®.</blockquote>
<!--kg-card-end: html-->
<p>With the reproducer in place and a way to measure the impact of the contention, it was just a matter of digging into the code and figuring out how to reduce contention and improve performance. After quite a big refactor spearheaded by my colleague <a href="https://maksimkita.com/" rel="noreferrer">Maksim Kita</a>, we <a href="https://github.com/ClickHouse®/ClickHouse®/pull/55121">managed to remove lock contention</a>, reducing the impact on our synthetic example by an order of magnitude. </p><p>Here's how we did it:</p><h2 id="context-refactoring">Context refactoring</h2><p>Here's the basic architecture of Context in ClickHouse® before the refactor:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/contextsharedpart.png" class="kg-image" alt="A basic architecture diagram of ClickHouse® Contexts, showing how ContextSharedPart and Context share a single global mutex." loading="lazy" width="2000" height="713" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/contextsharedpart.png 600w, https://tinybird-blog.ghost.io/content/images/size/w1000/2023/10/contextsharedpart.png 1000w, https://tinybird-blog.ghost.io/content/images/size/w1600/2023/10/contextsharedpart.png 1600w, https://tinybird-blog.ghost.io/content/images/2023/10/contextsharedpart.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Before we addressed this issue, </span><code spellcheck="false" style="white-space: pre-wrap;"><span>ContextSharedPart</span></code><span style="white-space: pre-wrap;"> and </span><code spellcheck="false" style="white-space: pre-wrap;"><span>Context</span></code><span style="white-space: pre-wrap;"> instances all shared a single global mutex. </span></figcaption></figure><p>In ClickHouse®, <code>ContextSharedPart</code> is responsible for storing global shared objects that are shared between all sessions and queries, for example:</p><ul><li>Thread pools</li><li>Server paths</li><li>Global trackers</li><li>Clusters information</li></ul><p><code>ContextSharedPart</code> also provides a lot of useful methods for working with these objects with synchronization.</p><p><code>Context</code> is responsible for storing query- or session-specific objects, for example:</p><ul><li>Per query settings</li><li>Per query caches</li><li>Current database</li></ul><p><code>Context</code> also provides a lot of methods for working with these objects, and also it uses <code>ContextSharedPart</code> to provide some of that functionality.</p><p>During query execution, ClickHouse® can create a lot of <code>Contexts</code> because each subquery in ClickHouse® can have unique settings. For example:</p>
<!--kg-card-begin: html-->
<iframe width="100%" src="https://snippets.tinybird.co/XQAAAALhAAAAAAAAAABBKUqGk9nLKwj8-A0SOPXuMH2nq2Z3bj1VSsxdAidEVP80NyBX89BVIGiIkIandNEWBL-W2WGBi5iDEGwlM4gBF0QgUmW70GabtyB7d_SsWjZF0UE3MI9_2803FSxCfnJ5ViJvG5timi3-rsOb5l4HI3MBrOowsKw6qhM-lHMiUsQDntc5DdW9GRbqSzezY0eavvRkXyVlOk1rE6WIdra6ydrntzI8wgu44M_NFcoiH_92OgAA/embed"></iframe>
<!--kg-card-end: html-->
<p>In this example, a nested subquery will have <code>max_threads = 16</code>, requiring its own <code>Context</code>.</p><p>The problem was that a single mutex was used for most of the synchronization between <code>Context</code> and <code>ContextSharedPart</code>, even when we worked with objects local to <code>Context</code>. A large number of low-latency, concurrent queries with many subqueries will create a lot of <code>Contexts</code> per query, and the problem becomes even bigger.</p><p>We did a big refactoring, replacing a single global mutex with two <a href="https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock">read-write mutexes</a>. One global read-write mutex for <code>ContextSharedPart</code> and one local read-write mutex for each <code>Context</code>. We used read-write mutexes because most of the time we do a lot of concurrent reads (for example read settings or some path) and rarely concurrent writes.</p><p>In many places, we completely got rid of synchronization where it was used for initialization and used <a href="https://en.cppreference.com/w/cpp/thread/call_once"><code>call_once</code></a> for objects that are initialized only once.</p><h3 id="the-context-architecture-after-our-refactor">The Context architecture after our refactor</h3><p>Here's the way things looked after our refactor:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/contextsharedpartrefactor.png" class="kg-image" alt="A diagram showing refactored ClickHouse® contexts, which each Context getting its own individual read-write mutex." loading="lazy" width="2000" height="702" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/contextsharedpartrefactor.png 600w, https://tinybird-blog.ghost.io/content/images/size/w1000/2023/10/contextsharedpartrefactor.png 1000w, https://tinybird-blog.ghost.io/content/images/size/w1600/2023/10/contextsharedpartrefactor.png 1600w, https://tinybird-blog.ghost.io/content/images/2023/10/contextsharedpartrefactor.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">After refactoring, each </span><code spellcheck="false" style="white-space: pre-wrap;"><span>Context</span></code><span style="white-space: pre-wrap;"> had its own read-write mutex.</span></figcaption></figure><p>We also <a href="https://github.com/ClickHouse®/ClickHouse®/pull/55278">added</a> clang <a href="https://clang.llvm.org/docs/ThreadSafetyAnalysis.html">ThreadSafetyAnalysis (TSA) annotations</a>, so we can be sure that our refactoring does not introduce race-conditions or deadlocks.</p><h2 id="bringing-the-cpu-to-100">Bringing the CPU to 100%</h2><p>Once both changes were accepted and merged to ClickHouse® master we decided to run some tests in a production-like environment. The main objective was to see if we could reproduce the issues and assess the impact of the fix by simulating our clients' workload on two different ClickHouse® versions.</p><ul><li><strong>23.9</strong>: Contains the metric to measure the impact</li><li><strong>master</strong>: Contains the <code>ContextLock</code> fix</li></ul><p>So, we recreated the databases and tables, extracted 1,000 example queries from the last seven days, and ran them continuously for a few minutes using <code>clickhouse-benchmark</code> with enough concurrency:</p>
<!--kg-card-begin: html-->
<iframe width="100%" src="https://snippets.tinybird.co/XQAAAALrAAAAAAAAAABBKUqGk9nLKv3YEWqfeB6D0mLTd2_vDQ8UPl-Z1BGvc3a-fq_HgcrI9cWhANeNMKkMwsR8MGLB7tYiLGPsRT4ENRZ5YNI-IUMKmHrVixfiUXhtbk1m0T6dJ2bhFxZgDnDp8ENXlDZvm8HXGvQ44Elhha78YFuJWlLhj5XdEhL7u0A4xjKJprCRuidukoHnvscOT9eZ7wt89d3uGPMpF4PKUgPLvJrVgkkZjjOGqXm1kQoRNv__73WIAA/embed"></iframe>
<!--kg-card-end: html-->
<h3 id="results-with-239">Results with 23.9</h3><p>Here are the results we saw on 23.9, where we implemented a way to monitor the impact of the Context lock contention:</p>
<!--kg-card-begin: html-->
<iframe width="100%" src="https://snippets.tinybird.co/XQAAAALpAwAAAAAAAABBKUqGk9nLKvXbHOiCiV-nuqyknXylY-DbYgQh0N0RxhK2vLLbM_W8XCZJZ0gZzb4_jFMnmeonNa4TDSbJ2Xbq4X_yFXO52N4_uZbHdjXsFQYzSyTV1f0uw06tmCKI89EEn7O7XMdyhdTCNDq9MqI1LDc0LF9UIF3dsLd3SkqXbQCH19Agz_zngPW2BrriCShC3TUQ-mfAhmg6c-_JfxwK-Vp1rRXoldhOAQehIDuZ_OrZhT0SE8VDyVxrRwhL1JexWjFt-PPY5oLvYxmR95bxnuyDoGAG7BcVQgtWuk5zJAGTNmk5N6z0zUygVJOtgrw-B6eyGg8oleoT9Mak5axBHjta6GgvTJzzeItaBQMltinIQFh4lxy6XzQca8cTDBfmv8dw0zOR69jpYsLnBx7y5UV4J4T21YPeOKFug0UKbiiPrOod0yw5cP6MU2W8IqBkCdbdLaF6p3ASQTVhYjRsfBXWsd890bI-sieDLiX8M9X498fKY3wYFq68loLaSqntmIIWOccVy-nJHt9NVK7PlcJ3xvjUY4EB5fRdVIj4sPAPaaol22t2utUp8ZGakOL3N4KJGY2G_-lk0AA/embed"></iframe>
<!--kg-card-end: html-->
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/23_9--1-.jpg" class="kg-image" alt="Two Grafana dashboards showing CPU utilization and load on a ClickHouse® cluster" loading="lazy" width="1064" height="340" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/23_9--1-.jpg 600w, https://tinybird-blog.ghost.io/content/images/size/w1000/2023/10/23_9--1-.jpg 1000w, https://tinybird-blog.ghost.io/content/images/2023/10/23_9--1-.jpg 1064w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Baseline results on 23.9</span></figcaption></figure><p>From these results, you can see that, as a baseline, we were able to achieve these results:</p><ul><li>~200 QPS</li><li>CPU utilization of only ~20%. </li><li>Half of queries took at least 1s</li><li>Slowest queries took ~75s</li></ul><h3 id="results-with-master">Results with master</h3><p>Here are the results we saw on master, where we had refactored Contexts to attempt to resolve our lock contention:</p>
<!--kg-card-begin: html-->
<iframe width="100%" src="https://snippets.tinybird.co/XQAAAAKkAwAAAAAAAABBKUqGk9nLKvkHJ4hwB0hpEGN6Xd60jjt5jn0BT6ibImCCZUStfUFcs5-T4czmQctea3toUM4ZBSkMh5vUYbXoSu4Cl7PlTMWvMcuZ_QftPu5UHhLixuvbHAcFpuHcSF1uJEU6jtseV7OYPjuNi8gDW3OCXJM49264rmG0D8lUyM7Ef16gAhtxUO2hWzUiBKKjJq03AuIigVY8WQ5eMx8IX4cr_LWdOxI4xyRdKzOkTbtragdurhi90Gc7neMOH37vYCXtLRmE4exUGwUp32l5132b9LakhdoamAVR249klpdeeIA6Hdsxb9nQI8fS4zzqRH-tBQmvQfTuvdDYxbGUFwtw_4qoDFdBUJCNom-52RKY-BJWb-zj4KhK0VM5JKrS_Mt1zN6awTdRcc0OL5g8b6gnfypDmIrYLXdYNisTbUCmw2hjZMYhjYaJ4yW-AebdGo_gDjFMJG5Cz7xClZR3TpMH5q8ehsrdGOKvjy1C4RTU9gqCS2Wx91hzKfFVgSXjgKwooa7_or-5q76slHT7Vr-pKcBOYEh4lPFb7mAWjfQhicVzKuzA_7JMtuA/embed"></iframe>
<!--kg-card-end: html-->
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/master.jpg" class="kg-image" alt="Two Grafana dashboards showing performance improvements after Context refactoring in ClickHouse®" loading="lazy" width="1349" height="340" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/master.jpg 600w, https://tinybird-blog.ghost.io/content/images/size/w1000/2023/10/master.jpg 1000w, https://tinybird-blog.ghost.io/content/images/2023/10/master.jpg 1349w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">With the fixes and at </span><code spellcheck="false" style="white-space: pre-wrap;"><span>concurrency=500</span></code><span style="white-space: pre-wrap;"> we were able to 3x performance.</span></figcaption></figure><p>With the refactor, we managed these results:</p><ul><li>~600 QPS (<strong>~3x better</strong>)</li><li> CPU utilization to ~60% (<strong>~3x better</strong>). </li><li>Median query time to ~0.6s (<strong>~2x better</strong>)</li><li>Slowest queries to ~6s (<strong>~12x better</strong>)</li></ul><p>Since tests with master were successful, we added more concurrency to <code>clickhouse-benchmark</code> (1000 instead of 500) to see what would happen. Here are the results:</p>
<!--kg-card-begin: html-->
<iframe width="100%" src="https://snippets.tinybird.co/XQAAAAKQAwAAAAAAAABBKUqGk9nLKzhRtWu_ioCjCJ-49rO8O0VIDxNyAlcAeC91CDX1zAtM3oChxQ3u9b9tOp0it3CILUZA1I_TotMXUXVUAiQAA9nNAf8uXgi-bYihp_RxaCvCgmTWIq_U1b8tg5wDUC_KlzhR3tZ0o3dmkFbHFW_y8_WXs-laBuR_kuiIdV_-SsOZAgY2XkMkYvEqlkkvBCsk7HUBIhQhbe581y77jiuaajzYO2JpKMiHoehTjvDaeGcN5ruEbFzNj1No73lYJYtxPV8l9YuWJ2UecDBL8tKBoe-ScC1k6K7e0EoCcWMjAaCPJmW8sMwrEX7GhELJauzZf_iEBy5Qc1-LJp8omtPt04zXmE5nc4483DXaVs0gz2H4Q76Bcj7UAzYJus8s63N3bj5Hxu81CrddV_JeeAvctrmUwF42-ZREks0NKc0XEfSBUMl-o6sKGRub_xhUndk0oN73QajPevtHXk9VPEuftpzERqmBBt6uRh1NlyoIaJfgvS_AGYC_zrra4IEb4HmsQ_Dg4-aRGIzffcyk4IR0jUn1oaMY6sPYTu6Xs7qoDYrN5xwM5O1Qiv_pd4IA/embed"></iframe>
<!--kg-card-end: html-->
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tinybird-blog.ghost.io/content/images/2023/10/master_1000.jpeg" class="kg-image" alt="Two Grafana dashboards showing a big performance improvement on ClickHouse®." loading="lazy" width="1088" height="340" srcset="https://tinybird-blog.ghost.io/content/images/size/w600/2023/10/master_1000.jpeg 600w, https://tinybird-blog.ghost.io/content/images/size/w1000/2023/10/master_1000.jpeg 1000w, https://tinybird-blog.ghost.io/content/images/2023/10/master_1000.jpeg 1088w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Boom. ~100% CPU utilization and at least 5x performance boost!</span></figcaption></figure><p>These results are incredible:</p><ul><li>~1,000 QPS (<strong>~5x better</strong>)</li><li>~100% CPU utilization (<strong>~5x better</strong>)</li><li>0.033s median response time (<strong>~30x better</strong>)</li><li>Slowest queries ~4s (<strong>~20x better</strong>)</li></ul><p>Of course, these are just tests, and we don't expect to get a 5x improvement in production, since we certainly will find other bottlenecks with I/O, CPU, and/or memory.&nbsp;But, we have at least removed a contention that had stumped us for a long time. Even if all we got was a 1.5x bump in performance, this is a <em>massive</em> improvement for us, our infrastructure, and the customers who use it.</p>
<!--kg-card-begin: html-->
<blockquote>We don't expect to get a 5x boost in production, but even if it's just 1.5x, that's still a massive improvement for us and our customers.</blockquote>
<!--kg-card-end: html-->
<p>With the ClickHouse® 23.10 official release (<a href="https://clickhouse.com/company/events/v23-10-community-release-call" rel="noreferrer">slated for November 2nd</a>), we will update some of our most performance-demanding clients, and see the real impact of these changes. </p><p>We expect good things.</p>
