How the Cost Based Optimizer Develops the Optimal Execution Plan for Each SQL Query

Laetitia

June 14, 2026

How the Cost Based Optimizer Develops the Optimal Execution Plan for Each SQL Query

In a world where databases process colossal volumes of information daily, SQL query performance has become a major challenge. At the heart of this battle for efficiency, the Cost Based Optimizer (CBO) stands as the invisible conductor revealing the full power of a DBMS. Its role is crucial: to analyze the various possible execution strategies, estimate their estimated cost in resources such as CPU, disk I/O, and memory, and select the most suitable execution plan for the submitted SQL query. This adaptability ensures both speed and resource savings, a requirement for companies handling immense databases in increasingly complex environments.

Emerging in the early 1980s, the Cost Based Optimizer profoundly revolutionized how relational databases operated. Before its arrival, classical optimizers applied fixed rules often ineffective against the diversity and evolution of data. Today, the CBO relies on fine and dynamic collection of statistics on indexes, data distributions, and table cardinalities. These elements allow it to model the different possible paths to execute a query and compare their costs. A misunderstanding or obsolescence of statistics directly translates into suboptimal choices, sometimes catastrophic.

The emergence of cloud, distributed, and hybrid environments has imposed new challenges on the Cost Based Optimizer. Response time must be minimized despite the growing complexity of data sources and issues related to network transfer. Moreover, recent advances such as adaptive optimization now correct in real-time certain discrepancies between forecasts and execution reality, ensuring unprecedented flexibility.

Let us then dive into the fascinating world of the Cost Based Optimizer, discovering its precise mechanisms, the statistics it relies on, the algorithms it chooses, and the innovations shaping its future in advanced SQL query optimization.

The historical and theoretical foundations of the Cost Based Optimizer in relational systems

The history of the Cost Based Optimizer truly begins in 1979 within IBM laboratories, with the publication of a seminal paper entitled “Access Path Selection in a Relational Database Management System” attributed to Patricia Selinger and her team. This paper lays the mathematical foundations allowing the quantitative evaluation and comparison of different execution plans for SQL queries, inaugurating an approach based on resource efficiency consumed rather than static rules.

Before this revolution, DBMSs mostly used rule-based optimizers (Rule Based Optimizer). These followed fixed priorities, for example systematically favoring index usage when possible, regardless of the actual data context. This rigidity harmed overall performance, especially for databases with heterogeneous sizes or in constant evolution.

The concept introduced by Selinger is therefore based on the notion of estimated cost. Each execution plan of an SQL query, i.e., a sequence of operations on the data (scans, joins, sorts…), is assigned a numerical value expressed both in disk access units and CPU cycles. The CBO thus generates a tree of alternative plans, with branches representing different join algorithms such as nested loop, hash join, or merge join.

The optimizer calculates the cost of each scenario using a probabilistic model fed by detailed statistics: table cardinality (number of rows), filter selectivity, and data distribution via histograms. This last point allows, for example, understanding how uniformly or irregularly a column is distributed, impacting the relevance of an index or a sorting method.

This approach inaugurates a dynamic and fine-grained management of queries, as the choice of the optimal plan adapts according to the characteristics of the existing data, rather than a fixed configuration. This innovation has durably influenced modern relational systems such as Oracle Database, PostgreSQL, or SQL Server. It offers a critical performance gain for online analytical processing (OLAP) applications, where billions of rows are regularly queried.

The theoretical advances initiated in 1979 have since led to a multitude of plan optimization algorithms, refined by progress in statistics and heuristic calculations. The process now implements complex search techniques in enormous combinatorial spaces, using strategies such as pruning or metaheuristics to handle possible plan explosion when the number of tables involved multiplies.

The central role of statistics in the development of the optimal execution plan

The core of the Cost Based Optimizer undoubtedly rests on the quality of the collected statistics. These descriptive data on tables, indexes, distribution, and row selection feed the cost estimation functions. Without a reliable base, the CBO risks generating erroneous choices, inducing miraculous gains but sometimes huge losses in performance.

Three major types of statistics govern these calculations: cardinality, selectivity, and histograms of value distribution.

  • Cardinality: This parameter essentially indicates the total number of rows in a table or the number of estimated rows at the output of an operation such as a join or filtering. This data allows judging the volume of data to be processed.
  • Selectivity: It specifies the proportion of rows retained by a given predicate. For example, the condition WHERE “age > 50” potentially filters 20% or only 5% of rows depending on data distribution.
  • Histograms: These describe the actual distribution of values in columns. They are rows of frequencies that help anticipate non-uniform distributions – a good CBO relies on this depth to adjust its estimates.

Management systems such as Oracle offer integrated procedures like DBMS_STATS.GATHER_TABLE_STATS to automate the collection and update of these statistics. This process is generally scheduled daily to guarantee their freshness. PostgreSQL uses the autovacuum daemon coupled with the ANALYZE command to detect changes and refresh data automatically when a modification threshold is reached (unless configured otherwise). SQL Server by default enables the AUTO_UPDATE_STATISTICS property for the same purpose.

These refresh mechanisms are crucial because even slight obsolescence of statistics causes skewed estimates. For example, outdated figures lead the CBO to assume that an index is optimal for a join, whereas in reality a sequential scan would be faster. Such an error can multiply execution times by 10 or even 100, depending on volume.

To continuously monitor the quality of statistical data, third-party solutions like SolarWinds Database Performance Analyzer or pgStatsTuner have established themselves in professional environments. They alert in case of degradation and provide comprehensive reports enabling DBAs to act quickly, ensuring the relevance of CBO choices day after day.

How statistics granularity impacts algorithm choice

Bases like PostgreSQL allow modifying the default_statistics_target parameter which controls the granularity of histograms. The higher the granularity, the more precise information the CBO has to calculate the estimated cost of each step. Conversely, this increase generates an overhead during collection.

For example, in a query involving three tables, the CBO can generate half a dozen potential join plans, modulating methods (nested loop, hash join, merge join) according to selectivity. For complex queries with eight or more tables, alternatives number in the hundreds or thousands, making the quality of statistics even more decisive for effectively pruning the search space.

Algorithm choice and join strategies: nested loop, hash join, and merge join

A major decision of the Cost Based Optimizer concerns the type of join to apply between several tables involved in an SQL query. Three main algorithms stand out: nested loop join, hash join, and merge join. The optimal choice primarily depends on data volume, the presence of indexes, as well as existing statistics.

The nested loop join is often preferred when the outer table is small and the inner table indexed. It works like two nested loops, testing each row of the outer table against corresponding ones in the inner table. Its simplicity is effective on small volumes, but its complexity rises quadratically with data size.

The hash join relies on a phase of building an in-memory hash table from one of the tables, then a probing phase on the entries of the second table via this structure. This mechanism is particularly efficient on large unindexed tables and when enough memory is available to hold the hash structure, drastically reducing processing time compared to nested loop.

The merge join exploits data ordering. Both tables are sorted on the join key, which then allows simply merging their corresponding rows without repeated searches. This method is highly efficient for already ordered or indexed sets, but the prior sorting phase can generate overhead in resources.

The Cost Based Optimizer weighs these alternatives based on its estimated cost model and the availability of indexes. For example, on a high volume where the index is fragmented or partially invalid, hash join may prevail despite its greater memory need. Conversely, on a small table, nested loop is often still the fastest.

Modern systems such as Oracle or PostgreSQL include modulators in the optimizer, allowing the CBO to adopt hybrid plans. They can thus start with a nested loop join on data subsets, then switch to a hash join on other segments, maximizing overall performance.

Analysis of execution plans to improve SQL query optimization

Fine understanding of the execution plan generated by the Cost Based Optimizer is essential for all developers and administrators who wish to master SQL query performance in their systems.

An execution plan details the sequence of operations that the database engine performs, including table accesses, reading methods (full scan, index scan), different join types, and data sorting. Each step is associated with an estimated cost calculated from statistics, representing the expected consumption in CPU, memory, or disk access.

Exploring this plan notably allows identifying:

  • Costly scans linked to inefficiently used or absent indexes.
  • Suboptimal join choices resulting in exponential loops.
  • Sorting and grouping operations that can be reduced or avoided.
  • The impact of complex WHERE clauses on estimated cardinality.

In 2026, a recurring example observed concerns an e-commerce company analyzing its daily transactions. During an SQL query on several tables, examination of the execution plan revealed that the CBO massively underestimated the cardinality of a join, causing an inefficient nested loop. After refreshing and precisely collecting statistics, the CBO chose a more suitable hash join, reducing response time by 85%.

Modern DBMSs provide graphical tools to visualize execution plans. SQL Server Management Studio offers detailed views, Oracle SQL Developer integrates tree-like representations, and PostgreSQL provides EXPLAIN ANALYZE, a tool combining plan and actual results to refine analysis.

It is also common to use hints or directives in the SQL query to temporarily force the use of a specific plan when the CBO errs. However, this practice should remain exceptional as it limits the dynamic adaptability of the engine and may degrade performance in the medium term.

Current limitations and challenges of the Cost Based Optimizer facing complex queries

Despite major advances, the Cost Based Optimizer faces growing difficulties, especially when queries become very complex, involving multiple tables, aggregations, or sophisticated dimensional schemas. Indeed, each error in the initial cardinality estimation can propagate and amplify through subsequent steps, a phenomenon known as estimation error amplification.

Star schemas in data warehouses well illustrate this problem: multiple joins on large fact tables and their dimensions induce a cascade of sometimes biased estimates. In some cases, the chosen plan can be suboptimal on 15 to 25% of queries, according to TPC-DS benchmarks published in the last decade.

To address these challenges, several databases have integrated mechanisms called adaptive optimization. For example, Oracle 12c introduced Adaptive Query Optimization, capable of correcting during execution an initially judged suboptimal plan by reevaluating actually observed statistics. PostgreSQL 14 and SQL Server 2022 have also improved their cardinality estimator by more precisely modeling column correlation, reducing error by a factor of three to five in some cases.

However, complex predicates on correlated columns remain a weak point, as automatic statistical collection does not always capture these dependencies. Some machine learning tools are currently exploring hybrid approaches, using execution history to better model these difficult aspects.

The Cost Based Optimizer in cloud and distributed environments: new challenges and adaptations

With the massive rise of cloud computing and distributed architectures, the Cost Based Optimizer evolves to handle even more complex contexts. The challenge is to optimize queries that exploit data dispersed over clusters of multiple nodes, often with columnar storage formats like Parquet or ORC.

The classical concept must integrate a new factor: the network cost generated by transfer between nodes. Whereas in a centralized system only CPU and disk resources matter, in a distributed environment the CBO must also minimize the amount of data exchanged to avoid latency and network congestion.

Projects like Apache Spark gave away the secret as early as 2017 with the introduction of a native CBO enabled via spark.sql.cbo.enabled=true, capable of delivering 2 to 8 times gains on multi-table joins. Similarly, Presto (now Trino) developed a specific model based on cost annotation in the plan tree traversed node by node.

On the front of giants like Google BigQuery, the CBO is proprietary and invisible to the end user, who nonetheless benefits from automatic dynamic optimization. The main challenge lies in the quality of statistics collected on heterogeneous sources, ranging from data lakes to JDBC connectors for traditional databases. The absence of robust statistics sometimes forces engines to adopt generic heuristics, degrading the final quality of plans.

Data actors must therefore strive to enrich and standardize statistical data in these hybrid ecosystems, to guarantee the effectiveness of the cost based optimizer and optimize execution costs in the cloud where each consumed resource translates into a financial expense.

Costs, licenses, and functional differences of cost-based optimizers in 2026

The 2026 market presents a rich offer of solutions integrating cost-based optimizers, but advanced features such as adaptive optimization or automatic statistics update often remain locked behind premium license tiers.

The following tables illustrate this price and functional segmentation well:

Solution Edition Adaptive Optimizer Automatic Statistics Update Indicative Price
Oracle Database Enterprise Edition Yes (AQO) Yes (DBMS_STATS) ~€25,000 / processor
Oracle Database Standard Edition 2 No Partial ~€5,000 / processor
SQL Server Enterprise Yes (CE v160) Yes (AUTO_UPDATE) ~€14,256 / core
SQL Server Standard Limited Yes (AUTO_UPDATE) ~€3,945 / core
PostgreSQL Open Source Partial (v14+) Yes (autovacuum) Free
Google BigQuery On-demand Yes (proprietary) Yes (automatic) ~$6 / TB processed
Apache Spark Open Source Native CBO since v2.2+ manual Free (infra extra)
Databricks Enterprise (DBU) Yes (Photon Engine) Yes (Delta Statistics) ~$0.75 / DBU

This table highlights how the deployment of an effective Cost Based Optimizer depends not only on algorithms and statistics but also on budgets and business needs of companies. For high-volume environments with strong performance requirements, investment in advanced editions is often largely justified by time and efficiency gains.

Key steps in SQL query optimization with the Cost Based Optimizer

To better understand the complexity of the process, here is a simplified outline illustrating how a Cost Based Optimizer develops an optimal execution plan:

  1. Syntax analysis: The engine translates the SQL query into a tree representation of possible operations.
  2. Rewriting and simplification: Some rules simplify or transform the query to reduce the search space.
  3. Statistics collection: Examination of tables, indexes, histograms, cardinalities, and available selectivities.
  4. Plan exploration: Generation of a set of alternative execution plans, combining join types, operation orders, and access methods.
  5. Estimated cost: Calculation of the predictive cost of each scenario based on statistics and models.
  6. Plan selection: Choice of the plan presenting the lowest total cost.
  7. Execution: Launch of the query according to the selected plan.
  8. Adaptive optimization (on supported systems): Possible dynamic adjustments if execution reality diverges.

Each step is essential to obtaining an optimal plan. Some databases like Oracle or SQL Server include specific activities during collection to anticipate the effect of parallelized or partially disruptive plans, which further complicates the algorithm.

The entirety of this chain of operations explains why SQL performance tuning is a profession in its own right, combining deep knowledge of the DBMS, statistical computing, and field experience.

Nos partenaires (2)

  • digrazia.fr

    Digrazia est un magazine en ligne dédié à l’art de vivre. Voyages inspirants, gastronomie authentique, décoration élégante, maison chaleureuse et jardin naturel : chaque article célèbre le beau, le bon et le durable pour enrichir le quotidien.

  • maxilots-brest.fr

    maxilots-brest est un magazine d’actualité en ligne qui couvre l’information essentielle, les faits marquants, les tendances et les sujets qui comptent. Notre objectif est de proposer une information claire, accessible et réactive, avec un regard indépendant sur l’actualité.