Considering Introduced Redundancy
(Denormalization)

By definition, a fully normalized data model involves no redundancy. If some redundancy is introduced, so is the possibility of inconsistency, and triggers need to be used to ensure that consistency is maintained when data is inserted, updated or deleted.

Why introduce redundancy?

Given the performance cost of executing such triggers, why would you want to introduce redundant information to a nice logical data model? The answer is, If the cost of determining that information from the related data would be significantly greater than the cost of maintaining it.

Of course, the measure of cost used here must account for the relative frequency of the respective data manipulation and query operations, the timing of these operations, and their response time targets, rather than just the I/O and CPU costs of the respective operations.

The potential benefits of introduced redundancy (or denormalization) are as diverse as query execution plans themselves. Nevertheless, the following examples illustrate much of the scope of this technique.

Avoiding Joins for Lookups

Code tables are used in many applications to validate the codes used in data entry, and to provide full descriptions for queries. You may choose to keep the description in the parent record, instead of the code, in order to avoid a join during queries.

Or consider hierarchical data, such as a hierarchy of corporate entities. If you commonly need to relate entities to their ultimate parent, then storing the key for the ultimate parent in each record will improve performance (and simplify the SQL).

Avoiding Joins for Aggregations

In an order entry system, you may choose to redundantly store the total value of each order in the orders table, to avoid a join to and sum over the line items table.

Or in a financial system, you may choose to redundantly store year-to-date running totals for each general ledger account, in order to avoid a join to and sum over the transactions table.

Avoiding Correlated Subqueries

In time-series data, where each record represents a period, you may choose to store both the start and end dates for the period in each record. This enables simple BETWEEN predicates to be used against the table, instead of resource-hungry correlated subqueries that would need to be executed for each row processed.

The benefits of such redundancy can be impressive. However, given the costs, you should only adopt such strategies if there is a clear performance advantage to be gained.


Copyright Ixora Pty Ltd Send Email Home