When OR Just Won't Do: Using UNION for Higher SQL Performance
Copyright © 2024 PawSQL
Problem Definition
If two query conditions are connected by OR, the database optimizer may not be able to use indexes to execute the query. For example, for the following SQL statement:
select * from customer where c_phone like '139%' or c_name = 'Ray' 
Even if both fields have indexes, they can be rewritten as a UNION query to leverage the indexes and improve query performance:
select * from customer where c_phone like '139%'
union  
select * from customer where c_name = 'Ray'
However, this transformation does not always improve performance. It requires certain conditions and cost-based estimation.
If the database supports
INDEX MERGING, you can also enable theINDEX MERGINGoptimization strategy by adjusting database parameters to improve database performance.
Applicable Conditions
- The OR-connected conditions must be indexable.
- The estimated cost of the rewritten UNIONstatement must be lower than the original SQL.
- If the ORbranch conditions are mutually exclusive, rewrite them asUNION ALLinstead ofUNION.
Case Analysis
Case 1. Conditions cannot utilize indexes, no rewrite optimization
select * from customer 
where c_phone = '1' or c_phone like '%139%' 
Analysis: One condition c_phone like '%139%' cannot use indexes. Even after rewriting, a full table scan is still required. PawSQL does not rewrite.
Case 2. Filter condition selectivity is low enough, no rewrite optimization
select * from customer 
where custkey = 1 and (c_phone = '1' or c_phone like '%139%')
Analysis: custkey is the primary key. custkey = 1 uniquely locates one record, so no need to rewrite.
Case 3. Conditions met, use rewrite optimization
select distinct * from customer 
where c_phone like '139%' or c_name = 'Ray' 
Analysis: Both conditions can use indexes and the selectivity is less than 10%. Can be rewritten. distinct can be removed after rewriting by union.
select * from customer where c_phone = '1'  
union
select * from customer where c_phone like '139%' 
Case 4. OR condition branches are mutually exclusive, rewrite to UNION ALL
select * from customer where custkey = 1 or (custkey = 2 and c_phone like '139%')
Analysis: The two condition branches custkey = 2 and c_phone like '139%' and custkey = 1 are mutually exclusive. Rewrite to UNION ALL:
select * from customer where c_custkey = 2 and c_phone like '139%' 
union all
select * from customer where c_custkey = 1
Case 5. Contains order by and limit, use rewrite optimization
select * from orders o  
where O_ORDERDATE>='2021-01-01' and (O_ORDERPRIORITY = 1 or O_SHIPPRIORITY = 1)
order by O_ORDERDATE desc limit 10
Analysis: Although the selectivity of O_ORDERPRIORITY = 1 and O_SHIPPRIORITY = 1 is high, the total cost is lower by leveraging indexes to avoid sorting. PawSQL rewrites the optimization.
select *
from (
   (select /*QB_2*/ *
    from orders as o
    where o.O_ORDERDATE >= '2021-01-01'
    and o.O_SHIPPRIORITY = 1
    order by o.O_ORDERDATE desc limit 10)  
       union  
  (select /*QB_1*/ *
  from orders as o
  where o.O_ORDERDATE >= '2021-01-01'
    and o.O_ORDERPRIORITY = '1'
  order by o.O_ORDERDATE desc limit 10
  )
) as PawDT_1702555889039  
order by PawDT_1702555889039.O_ORDERDATE desc limit 10
Case 6. Contains grouping and aggregates, use rewrite optimization
select O_ORDERDATE, count(1) from orders o 
where O_ORDERDATE>='2021-01-01' and (O_ORDERPRIORITY = 1 or O_SHIPPRIORITY = 1)
group by O_ORDERDATE 
Analysis: Although the selectivity of O_ORDERPRIORITY = 1 and O_SHIPPRIORITY = 1 is high, the total cost is lower by leveraging indexes to avoid sorting. PawSQL rewrites the optimization.
select PawDT_1702884016144.O_ORDERDATE, count(1) 
from (
    select /*QB_2*/ o.O_ORDERDATE, o.O_ORDERKEY  
    from tpch_pkfk.orders as o
    where o.O_ORDERDATE >= '2021-01-01'
    and o.O_SHIPPRIORITY = 1
       union
select /*QB_1*/ o.O_ORDERDATE, o.O_ORDERKEY  
 from tpch_pkfk.orders as o  
 where o.O_ORDERDATE >= '2021-01-01'
    and o.O_ORDERPRIORITY = '1'
) as PawDT_1702884016144 
group by PawDT_1702884016144.O_ORDERDATE
Performance Validation
Case 5. 900x performance improvement
- Before optimization (execution time 432.322ms)

- After optimization (execution time 0.189ms)

Case 6. 15x performance improvement
- Before optimization (2.816ms)

- After optimization (0.204 ms)

🌐 About PawSQL
PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:
- PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
- PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
Contact Us
Email: service@pawsql.com
Website: https://www.pawsql.com
