Query Algorithms
Queries are useful as they reduce the number of scan operations on the underlying physical file structure.
For each relational operation, there can be several different access paths to the particular records needed. The query execution engine can have many algorithms designed to process particular relational operations and access path combinations.
Some examples of algorithms for both the select and join operations are given below.
(a) Selection Algorithms
The select operation must search through the data files for records meeting the selection criteria.
The following are some examples of simple (one attribute):
(i) S1. Linear Searchβ
Every record from the file is read and compared to the selection criteria.
- Execution cost for searching on a non-key attribute is **br**, where **br** is the number of blocks in the file representing relation _r_
- On a key attribute, the average cost is **br / 2**, with a worst case of **br**
(ii) S2. Binary Search on Primary Keyβ
A binary search, on equality, performed on a primary key attribute (file ordered by the key) has a worst-case cost of:
βlogβ‘2(br)β\lceil \log_2(br) \rceil
βlog2(br)β
This can be significantly more efficient than linear search, particularly for a large number of records.
(iii) S3. Search using a Primary Index on Equalityβ
With a BβΊ-tree index, an equality comparison on a key attribute will have a worst-case cost of:
- Height of the tree (in the index file)
- Plus one to retrieve the record from the data file
An equality comparison on a non-key attribute will be the same except that multiple records may meet the condition. In that case, we add the number of blocks containing the records to the cost.
(iv) S4. Search using a Primary Index on Comparisonβ
When the comparison operators (**<, β€, >, β₯**) are used to retrieve multiple records from a file sorted by the search attribute, the first record satisfying the condition is located and the total blocks before (**<, β€**) or after (**>, β₯**) is added to the cost of locating the first record.
```text
---
# **(b) Join Algorithms**
Like selection, the join operation can be implemented in a variety of ways. In terms of disk accesses, join operations can be very expensive, so implementing and utilizing efficient join algorithms is critical in minimizing a queryβs execution time.
The following are four well-known types of join algorithms:
---
## **(i) J1. Nested-Loop Join**
This algorithm consists of an inner for loop nested within an outer for loop.
To illustrate this algorithm, we use the following notations:
- **r, s** β Relations r and s
- **tr** β Tuple (record) in relation r
- **ts** β Tuple (record) in relation s
- **nr** β Number of records in relation r
- **ns** β Number of records in relation s
- **br** β Number of blocks with records in relation r
- **bs** β Number of blocks with records in relation s
---
```text
for each tuple tr in r
for each tuple ts in s
if join condition is true for (tr, ts)
add tr + ts to the result
Cost Analysisβ
- Each record in outer relation r is scanned once
- Each record in inner relation s is scanned nr times
- Total record scans = nr Γ ns
If only one block of each relation fits into memory:
- Cost = nr Γ bs + br
If all blocks of both relations fit into memory:
- Cost = br + bs
If all blocks of relation s (inner relation) fit into memory:
- Cost = br + bs
Therefore, if one relation can fully fit into memory, it is better to use it as the inner relation.
Noteβ
Even though nested-loop join is expensive in worst cases, it has an advantage:
- It does not impose restrictions on access paths
- Works regardless of join condition
(ii) J2. Index Nested Loop Joinβ
This algorithm is similar to nested-loop join, but:
- Uses an index on the inner relationβs join attribute
- Avoids full scan of the inner relation
Each lookup in the inner loop becomes an equality selection
Let:
- c = cost of lookup
(iii) J3. Sort-Merge Joinβ
This algorithm:
- Works for natural joins and equi-joins
- Requires both relations (r and s) to be sorted on join attributes
Key Pointsβ
- Each record in both relations is scanned only once
- Worst and best-case cost:
br+bsbr + bs
br+bs
- Variants are used when:
- Data is unsorted
- Secondary indices exist
(iv) J4. Hash Joinβ
Like sort-merge join, hash join is used for:
- Natural joins
- Equi-joins
Workingβ
- Uses two hash tables (one per relation)
- Records are partitioned based on hash values of join attributes
- Matching partitions are joined
Stepsβ
- Partition both relations using hash function
- Build hash table for smaller relation
- Probe with records from other relation
- Output matching tuples
Noteβ
- Collisions may occur
- Still efficient due to partitioning
Query Optimization
The function of a DBMS query optimization engine is to:
Find an evaluation plan that reduces total execution cost
Heuristic-Based Optimizationβ
- Uses mathematical rules
- Transforms query into more efficient form
Based Onβ
- Relational algebra expressions
- Query trees
- Cost estimation
- Query semantics
Goalβ
Select:
- Which rules to apply
- When to apply
- How to apply
Typically, optimization works on internal representations such as:
- Query tree
- Generated from high-level language (e.g., SQL)
Expression
After a high-level query (i.e., SQL statement) has been parsed into an equivalent relational algebra expression, the query optimizer can perform heuristic rules on the expression and tree to transform the expression and tree into equivalent but optimized forms.