DataSphereChronicle

Indexing - Mastering the 'Look Up'

Jan 26, 2024

Indexes are an indespensible tool in acheiving optimization in database retreval. Creating and maintaining indexes is an essential part of any database management role. Indexes are universal across all sorts of different databases - structured or unstructured. Let's explore what an index is, how it works, and best practices for managing them.

Sorting Out your Indexes

What is an index, really? Most know that indexes improve performance, but few understand how indexes impact databases. An index is sorting logic applied to a specific column on a table to optimize the look up. In most cases an index aligns to the ‘WHERE’ part of a query. Essentially, an index is a sorted version of a column in question, so we can search that column in a faster way to achieve faster performance. Take the following example - lets say we were looking up rows where ‘CITY = Zebulon’ - in the list we have the query would search through the column in ascending order - so a total of 7 lookups. This is not optimized since it has to check every row, and if your record happens to be the last entry, the search will have to look through each row before successfully returning.

Now lets say we knew that that CITY list was in alphabetical order. And since we know z is at the end of the alphabet we could do a much more efficient sort. Indexes default to a Binary Tree - or B-tree - sorting method (more about alternative sorting methods later in this pattern). With this method, we can search the list much more efficiently, finding Zebulon after only 4 lookups. You can see how such a pattern would drastically improve searching across millions or billions of rows.

All an index really contains is a mapping of a columns actual order in a database to a sorted list. When there is a filter on that column, it can search the sorted list stated in the index more efficiently than searching the table itself. Indexes do all this as well without duplicating data or creating another physical table.

In a database there are two kinds of indexes: clustered and non-clustered

Clustered indexes are just those that are created by default when a table with a primary key is created. To create this index there is nothing you need to do beyond creating the table with a declared primary key.

Un-clustered indexes are created as needed by database administrators or engineers to optimize a heavily used query for best read performance. These indexes are added to the table configuration of the column being optimized. Un-clustered index can refer to one or multiple columns with any sort method. Within non-clustered indexes there is a subcategory on multi-column indexes.

Behind the Curtain

Why don’t we make indexes on every column that is searched? Indexes or too many of them of course have drawbacks. Consider the following before building one:

Impact on Writes - There is no free lunch - and the consequence of indexes is that it can slow down your write operations. When you add a new row to the database, any fields that belong to an index will need to be added and sorted according to that index. Every time a write is made to the database, the indexes are unusable until they have updated. Therefore, if the database is constantly receiving writes, then the indexes will never be usable and cause congestion in your database.

Increased Maintenance - Indexes are another database object - like a view - that has to be maintained in the schema. When you migrate databases, you must make sure to copy over all indexes and continuously apply updates as needed. Indexes should be monitored that they are first still being used and secondly that they are providing a performance improvement.

Challenge of Multi-column Indexes - While multi-column indexes are very powerful, they are difficult to build, maintain, and sometimes can impact the write performance. Single column indexes are more flexible, where as multi-column indexes will result in poor performance when they do not work.

Adding Value to Everyday Operations

The follow are the best situations to add indexes to your database operations where drawback will be minimized and the most value will be gained.

Primary Keys - With clustered indexes, we should always have a unique index for our primary key. This will cover read optimizations for most basic look ups or counts on the table. This is one of the major reasons to declare a primary key, as the index is simply created by default along with the table with no additional work.

High Volume Queries - When you have a large percent of reads from a table covered by the same query, then you have an opportunity to leverage an index. For example, if your table’s reads are accounted for by 30% from one particular query (maybe the query that loads the landing dashboard in an application) this would lend to a good place to add an index. As your database becomes larger and larger, the more likely you are to see benefits from indexing.

Foreign Keys - Along similar lines for query performance, if you are doing consistent joins on a table, building an index on the defined foreign keys could improve your performance. If you have foreign keys in your tables that is frequently filtered or joined on with other tables, consider adding an index.

Read Heavy Databases - Overall, indexing should be a large operation for any database that is read heavy or where read performance is a priority. If your database is in a situation where over 50% of operations are reads, then indexing should be constantly re-evaluated and part of regular data operations. By the same logic, if your database is very write heavy, then indexing should be less of a priority and may be more costly to maintain than it’s worth.

In conclusion, in the ecosystem of database objects, indexes play an important role to keeping the database running efficiently. By understanding the fundamentals, advantages, drawbacks, and best practices, you can harness the full potential of indexing to elelvate the performance of your database.

By Ellie Najewicz