The Carbon Black Challenge: when the killer wasn’t in the database at all.

A high-stakes Oracle RAC relocation kept failing its dry runs — batch jobs running at half speed despite better hardware. The culprit wasn’t the queries, the network, or the cluster. It was a piece of endpoint-security software nobody had flagged, intercepting every inter-RAC packet. Here’s what went wrong, how we found it, and the instrumentation lesson that has stayed with me.


01 · The Setup

A relocation that should have been a win

The project was a major data center relocation: moving a substantial Oracle RAC cluster database to a new, better-equipped facility. The new site had superior hardware, more headroom, and a cleaner network topology. On paper, this was a performance upgrade, not a risk.

Plenty of preparation went in. Weekly cutover dry runs over several months. Detailed rehearsals. A long checklist of validations. The kind of project where, by month four, you expect the dry runs to be boring.

They weren’t. Something wasn’t right, and it kept being not-right week after week.

  • ~50% Batch performance on new hardware

  • Months Of weekly dry runs

  • 0 Hits on the obvious database suspects

  • 1 Hidden agent doing the damage

02 · The Problem

The dry runs that wouldn’t cooperate

Critical batch processing on the new infrastructure was consistently running at roughly half the throughput we were seeing in the source data center. Same code. Same data. Better hardware. Worse numbers.

The kicker: this wasn’t a transient blip. It reproduced every week, in every dry run, like clockwork. And every week the project clock ticked closer to a cutover date that we increasingly couldn’t justify hitting.

When the destination is supposed to be faster than the origin, and it’s reproducibly half the speed, you start questioning everything.

03 · Dead Ends

The troubleshooting trail that led nowhere

Standard Oracle performance troubleshooting hit the symptoms but couldn’t hit the cause. The wait events were real. They just weren’t the disease.

The Top SQL analysis surfaced the usual fingerprints of a stressed RAC cluster:

gc cluster waits
enq: KI — contention
enq: TX — allocate ITL entry
buffer busy waits

All of these are real wait events with real costs. None of them were the actual problem — they were what RAC looks like when something underneath it is broken.

Oracle Support went down the path you’d expect.

Patch upgrade chasing

Several patches were proposed because the symptoms resembled known issues. They didn’t resolve the problem — they treated the wait events without touching whatever was causing them.

Inter-RAC process tuning

Increasing the number of inter-node communication processes was suggested. A reasonable tactic for cluster waits — but again, hitting the surface, not the cause.

Top SQL re-tuning

Tuning the “heavy” SQL was a natural reflex. But the SQL wasn’t heavy because of the SQL. It was heavy because the cluster couldn’t talk to itself fast enough.

The instrumentation gap

The deeper problem: we couldn’t easily see OS-level CPU and network telemetry tied to specific processes. The metrics that would have unmasked the real culprit simply weren’t in the standard toolbox.

This is the part of the story that’s worth slowing down on. The team wasn’t doing anything wrong. The methodology was sound. The trouble is that when your instrumentation can’t reach the layer the problem lives in, you spend weeks pattern-matching on the layer above it.

04 · The Discovery

The hidden culprit: an endpoint agent

The breakthrough came when somebody finally looked sideways — not at the database, not at the SQL, but at what else was running on the operating system.

Carbon Black. Endpoint security software, installed silently as part of the standard OS build for the new facility, running quietly in the background and never mentioned in any database component inventory.

The mechanism made immediate sense once we saw it. Carbon Black monitors all system activity, including network traffic, to detect security threats. That’s its job. On a general-purpose server that’s a reasonable trade-off. On an Oracle RAC cluster — where every transaction is potentially generating a fast stream of small, latency-sensitive inter-node messages — it was a different story.

Every inter-RAC network packet was being intercepted and inspected. The CPU cost of that inspection, multiplied by the packet rate of a busy RAC cluster, was enough to materially slow the cluster interconnect. Higher inter-node latency directly translates into longer waits on gc events and the ITL / buffer-busy cascade we’d been chasing for months.

The wait events were right. They were pointing at a real bottleneck. They just couldn’t point at the agent causing the bottleneck, because nothing in the Oracle stack knew that agent existed.

Once you see it, it’s obvious. But you can’t see what isn’t instrumented — and that’s the whole problem.

05 · The Solution

Remove the agent, recover the platform

Removing Carbon Black from the database hosts produced an immediate, dramatic shift. Batch performance didn’t just recover — it exceeded the throughput we’d been getting in the original data center. The new hardware finally got to be new hardware.

The cutover went ahead. The project landed. The team moved on. But the lesson didn’t.

06 · Reflections

What I keep coming back to

With 20/20 hindsight, two things were going wrong simultaneously — and the second one is the one that actually matters.

First, the component inventory was incomplete. Carbon Black was a known troublemaker on database hosts from past experience. It should never have ended up on these systems without an explicit conversation. A complete, accurate inventory of everything running on a database host — not just Oracle components — would have caught this on day one.

Second, and more importantly, the instrumentation didn’t cover the surface area the problem lived in. Database performance tooling tells you what the database is waiting on. It doesn’t tell you what an out-of-band kernel-level agent is doing to your network packets. The metrics for the infrastructure layer the database depends on are routinely under-instrumented, badly instrumented, or accessible only with specialized tools and the experience to know where to look.

If those Carbon Black metrics — CPU consumption, packet interception rate, network latency contribution — had been exposed as first-class signals next to the database’s own wait events, my anomaly detection workflow would have pointed at this within hours, not months. Instead the team did what every competent team in this position does: pattern-match the visible symptoms against the visible tooling, and miss the invisible cause.

That pattern shows up over and over. Database problems that aren’t database problems. Network problems hiding inside CPU problems. Security agents, antivirus, monitoring overlays, kernel modules — all silently consuming resources the database thinks it has, none of them surfaced in the place a DBA would look.

The takeaway: the cost of a missing metric isn’t the time to capture it — it’s the months you spend tuning the wrong layer. Effective performance work depends on instrumenting a broad enough surface area that hidden contributors can’t hide. Everything else is pattern-matching with a blindfold on.

Footnote

About Carbon Black

What it is

Carbon Black is a security software suite providing endpoint protection, threat intelligence, and incident response. It is designed to defend against advanced threats, malware, and non-malware attacks, and runs on a range of operating systems including Red Hat Enterprise Linux. The Carbon Black App Control Linux Agent supports RHEL 6.7–6.10, 7.3–7.9, and 8.1–8.4, and installation is typically done via a script after extracting the appropriate TGZ archive. None of this is a problem on its own — it’s a perfectly reasonable piece of security tooling. It just isn’t reasonable to deploy on a latency-sensitive RAC cluster without measuring what it costs first.


Scaling Oracle: A Data Science Approach to Monitoring Scalability Solutions (Part 2)