← Back to blog

Logstash and Elasticsearch gotchas that quietly destroy performance

Architectural issues in Logstash and Elasticsearch that teams often miss until ingestion lag, heap pressure, or cluster instability shows up.

Nov 8, 20259 min read

When Logstash and Elasticsearch are slow, teams often reach for the wrong knob first. They increase JVM heap, add workers, or scale a data node vertically. Those actions sometimes buy time, but they rarely solve the architectural reason the pipeline is struggling.

In real systems, performance loss is usually caused by a small number of overlooked decisions:

  • too much work done in Logstash filters
  • index design that inflates segment and shard overhead
  • mappings that create unnecessary field explosion
  • ingestion patterns that force Elasticsearch into constant merge and refresh pressure
  • backpressure that is invisible until queues start dropping or latency spikes

This is the set of gotchas I see most often.

1. Logstash is not free compute #

Teams often treat Logstash as a place to do arbitrary enrichment because it is already in the data path. That becomes expensive fast.

A pipeline that does grok, multiple mutate passes, geo lookups, condition-heavy branching, and Ruby filters may still work at low throughput. At higher volume, it becomes CPU-bound long before Elasticsearch is the bottleneck.

The practical mistake is assuming that all transformations have equal cost. They do not.

  • dissect is usually cheaper than grok when the log format is stable
  • repeated regex work is expensive and scales poorly
  • Ruby filters are flexible but easy to misuse
  • enrichments that call external systems belong outside the hot ingestion path whenever possible

If the pipeline is business-critical, treat Logstash filters as production compute, not glue code.

What to do instead #

Use a cost hierarchy:

  1. normalize at the source when possible
  2. use deterministic parsers before regex
  3. keep enrichment shallow in the ingest path
  4. move heavy lookups or correlation into asynchronous post-processing

If you cannot explain which filters are the most expensive, you are probably overusing Logstash.

2. Bigger heap is often the wrong fix #

This is one of the most persistent Elasticsearch myths: if the cluster is slow, give it more heap.

Heap helps only when the workload and memory profile justify it. Otherwise, larger heap can increase GC pause time and delay recovery during pressure.

The more important question is why the node needs the extra heap in the first place.

Typical reasons:

  • too many shards
  • large field mappings
  • aggressive aggregations on high-cardinality fields
  • fielddata enabled where doc values would have been safer
  • query patterns that keep large working sets hot

In other words, heap pressure is often a symptom of index and query shape, not just a memory shortage.

3. Too many shards is still the most common architectural mistake #

Shards are not just containers for data. They are units of coordination, recovery, merging, and cluster-state complexity.

A common anti-pattern looks like this:

  • daily indices
  • multiple primary shards per daily index
  • low actual data volume per index
  • months of retention

That combination creates a huge number of small shards. The cluster spends effort managing structure instead of serving useful work.

Small shards are especially dangerous because they feel harmless early. The cluster works, dashboards load, and index creation is fast enough. Then retention grows, field counts grow, and suddenly master nodes spend too much time on cluster state while queries fan out across many tiny shards.

Better rule #

Choose shard sizing based on projected data volume and query behavior, not habit. Daily indices are not automatically correct. Sometimes weekly or monthly rollover is the better choice, especially when data volume is moderate.

Sharding should be capacity planning, not copy-paste configuration.

4. Dynamic mappings can become a hidden outage multiplier #

Dynamic mapping is convenient in early development. In production ingestion systems, it is dangerous when left unconstrained.

If logs or events contain uncontrolled keys, Elasticsearch can create a large and unstable mapping surface:

  • tenant-defined fields
  • nested JSON payloads
  • metadata with arbitrary key names
  • labels or tags with unbounded variety

This leads to mapping explosion, larger cluster state, higher heap usage, and slower indexing.

Even worse, the problem is often introduced by one noisy source while the rest of the cluster pays the cost.

Safer approach #

  • explicitly map stable fields
  • store highly variable payloads as flattened or raw JSON where appropriate
  • cap unbounded metadata
  • reject or quarantine malformed events before they poison the index schema

If schema governance is weak, Elasticsearch performance becomes a data hygiene problem.

5. Refresh and merge behavior punishes bad ingestion patterns #

Elasticsearch is not a queue with search on top. It is a search engine that happens to accept writes. The write path has costs:

  • refresh
  • flush
  • translog maintenance
  • segment creation
  • segment merging

If you force the cluster into overly frequent small writes, you increase coordination and merge pressure.

This is common when:

  • producers send tiny bulk requests
  • refresh is left too aggressive for a write-heavy use case
  • many pipelines write to many indices at once
  • hot indices receive bursts without enough buffer in front of them

The result is often misread as “indexing is slow”, when the real issue is write pattern inefficiency.

Practical guideline #

Use bulk requests that are sized by testing, not guesswork. Too small wastes overhead. Too large increases retry pain and memory spikes. The right size depends on document shape, node memory, and concurrent pipelines.

Also separate ingestion concerns from search concerns. If an index is primarily ingestion-heavy, optimize its refresh and lifecycle behavior for that reality.

6. Backpressure is usually present before anyone names it #

In a healthy ingestion system, you should be able to answer:

  • how far behind Logstash is
  • whether persistent queues are growing
  • whether bulk rejections are increasing
  • whether Elasticsearch thread pools are saturated
  • whether ingest latency is source-side or storage-side

Many teams cannot answer those questions clearly. They only notice trouble after users complain about missing logs or dashboards lagging by minutes.

That is an observability failure, not just a throughput problem.

Backpressure across Logstash and Elasticsearch should be treated as an architectural signal:

  • persistent queue growth means downstream cannot keep up
  • bulk retry growth means write coordination is unhealthy
  • indexing latency with low CPU may suggest merge or storage pressure
  • rising ingestion delay with normal node metrics can indicate pipeline inefficiency upstream

If you only watch host-level CPU and memory, you will miss the real failure mode.

7. Hot-warm-cold architecture fails when data access patterns are wrong #

Tiered storage is attractive, but teams often move data by age alone instead of by actual access pattern.

That causes two common problems:

  1. important dashboards hit colder tiers too early and become slow
  2. expensive hot storage is wasted on data that no longer needs low-latency search

Lifecycle policy should reflect:

  • query recency
  • operational use cases
  • incident response windows
  • compliance retention needs

A good ILM policy is a product decision as much as an infrastructure decision.

8. Multi-tenant indexing can become a noisy neighbor problem #

One large source can easily dominate ingestion, mapping growth, and query cost for everyone else when tenants share the wrong index strategy.

Common anti-patterns:

  • all tenants in one index with no clear field discipline
  • one template used for very different event shapes
  • tenant-specific metadata stored as first-class mapped fields

You do not always need a separate index per tenant, but you do need a deliberate isolation strategy.

That may mean:

  • separate indices for large or noisy tenants
  • routing strategies for predictable locality
  • stronger template boundaries
  • dedicated lifecycle rules for different traffic classes

The architectural question is not “single index or many indices”. The real question is how much blast radius you are willing to accept.

9. Query design can sabotage ingestion performance #

Clusters are often judged by search latency, but heavy queries also affect indexing health.

Examples:

  • large aggregations on high-cardinality fields
  • wildcard-heavy queries on analyzed text
  • dashboards that fan out across broad time ranges every few seconds
  • queries that force many shard hits for small user value

If the same cluster handles heavy ingestion and heavy analytics, query discipline matters. Otherwise, search traffic steals resources from the write path and the cluster appears unstable in both directions.

This is why capacity planning for Elastic should be workload-based, not node-count-based.

10. Recovery behavior matters more than peak benchmarks #

Many teams test throughput only when the cluster is healthy. That hides the architecture risk.

The better question is:

How does the system behave during:

  • node loss
  • index rollover
  • rebalance
  • delayed storage
  • bulk rejection storms

If a pipeline works only in the happy path, it is not really production-ready.

A resilient Elastic architecture should preserve:

  • bounded ingestion lag
  • understandable failure domains
  • recoverable queue depth
  • predictable shard relocation behavior

The quality of the design shows up during stress, not during a benchmark demo.

What I would review first in a struggling stack #

When an Elastic pipeline is slow, I would review these in order:

  1. shard count and average shard size
  2. mapping growth and dynamic field behavior
  3. Logstash filter cost, especially regex-heavy work
  4. bulk request shape and rejection rates
  5. refresh and lifecycle settings for hot indices
  6. queue growth and ingestion delay visibility
  7. query patterns competing with ingestion

That sequence usually reveals the real problem faster than tuning JVM heap or adding random workers.

Closing view #

Logstash and Elasticsearch performance problems are rarely caused by one dramatic bug. They come from quiet architectural decisions that compound over time.

The overlooked gotchas are usually structural:

  • schema discipline
  • shard strategy
  • ingestion pattern
  • query isolation
  • backpressure visibility

If those are handled well, the stack performs predictably. If they are ignored, every “quick fix” turns into another layer of operational debt.