Database - Data Life Expectancy
The data life expectancy is the amount of time that data is expected to remain in the database. You need to establish the requirement for how long fact data needs to be held online at the fact level. Older fact data can often be held online in aggregated form, reducing the amount of space it requires. This is a useful trick, but it has a cost. Rolling up the data requires processing power, and can require more space in the short term. This needs to be designed into the warehouse manager’s processing schedule.
For archiving there are three figures that need to be established. First, you need to establish the start date; that is, when archiving is required to begin. For example, if the aim to keep 3 years worth of data online, and the data warehouse is to start off with 18 months of historic data loaded, you will need to begin archiving 18 months after the go-live date. If designed properly, the load process will not depend on the archiving process being run first. Sufficient space should be made available to allow the archiving to occur later than immediately required. This allows the archiving to be performed at a more convenient time.
Data Archiving in a Data Warehouse
As data ages, it will need to be purged to clear room for more current data. There is often a desire to hold data online for long periods of time, but usually the practicalities of working with such huge volumes of data dictate a reasonable cutoff point. Most companies have a natural cycle of data that will indicate how long data remains immediately useful. These times will often conflict with business rules or legal requirements for how long data should actually be maintained. This generally means that data will need to be archived off the data warehouse, either to nearline storage, or to tape. Which method should be used will depend on how likely old data is to be retrieved. It is worth mentioning here that data warehouses rarely generate new fact data; they normally load fact data from other sources. This means that the source system will already have the data archived. This is often a sufficient argument to avoid archiving on the data warehouse.
If archiving is required it needs to be factored into the capacity planning and the overnight window design from the beginning. This is important even though archiving may not be required for some years. If it is not designed in from the beginning it may become a serious problem when it is required to run for the first time. It will probably increase the overheads of the overnight process and may even delay the start of the process by having to clean off space to allow new data to be loaded.
Performance Monitoring And Tuning
The warehouse manager is responsible for monitoring the performance of any operation it runs. It should also manage the storage of the performance statistics produced by the query manager. These statistics, along with the query history of all queries, should be stored in the database, so that they can be analyzed.
These statistics need to be kept for as long as possible. For each query the statistics could be averaged over time to reduce storage space. However, it is important to keep a time trail for these statistics, so that changing trends can be identified. Therefore, the data should not be aggregated over too large a time period, or it may mask trends. Older data can be aggregated over a greater time period if the averages do not change, but recent data should be averaged over a day or at most a week.
The warehouse manager is responsible for the creation of aggregations and indexes: therefore, any tuning that requires either an aggregation or an index will need to be passed on to the warehouse manager. Ideally the warehouse manager itself should be capable of identifying the need for new aggregations, and automatically creating them. There is no such predictive software on the market today, but software is beginning to appear that can perform that function at a basic level, and it is inevitable that more sophisticated software will follow.
Metadata Management
The warehouse manager is responsible for managing the metadata. Metadata is used to describe the data that resides within the data warehouse; it also describes where each piece of data resides. If new data is loaded, old data is archived or current data is moved within the data warehouse, meta data needs to be generated or-updated to keep track of where that data resides. For example, if the fact data is partitioned into multiple tables for management reasons, the warehouse manager will need to be aware of the partitioning key, and the range of key values that each partition contains.
This metadata will, be required by the query manager to allow it to generate queries. The query manager will also need to be able to identify when different tables contain the same data, but possibly as different levels of aggregation. The warehouse manager needs to generate this information when it creates or updates aggregations.
Query Scheduling and Monitoring
The query manager has several distinct responsibilities which are used to control various functions such as user access to the data, query scheduling and query monitoring. These areas are all very different in nature, and each area requires its own tools, bespoke software and procedures. The query manager is one. of the most bespoke pieces of software in the data warehouse. The tools required to perform the query manager’s tasks are not yet available, but are beginning to appear. They will become increasingly important as data warehouse user populations grow. Indeed, if the future dream of large user communities running against enormous data warehouses is to be realized, these tools are both essential and inevitable.
The query manager’s task is to address this problem by presenting a meaningful schema to the users via a friendly front end. The query manager will at one end take in the user’s requirements, and in the background using the metadata it will transform these requirements into queries against the appropriate data. Ideally, all user access tools should work via the query manager. However, as a number of different tools are likely to be used, and the tools used are likely to change over time, it is probable that not all tools will work directly via the query manager. Any restrictions on the users’ access will need to be imposed by the tool itself.
The Best Warehouse Manager Software
There is no warehouse manager software currently on the market that can .do this automatically, which means that human input is required. One suggestion to work around this problem is to allow a program or procedure to be associated with each aggregation. If one exists, the warehouse manager can use it instead of directly generating the aggregation from a single query. To make the best use of system resources, the warehouse manager may need to use parallelism for any given operation. It may also need to run multiple operations side by side. To ensure that the system is neither underutilized nor swamped with work, this process can be driven via the queuing mechanisms of the schedule manager.
To achieve this the aggregations may need to be prioritized. They may also need to be sequenced if one aggregation is to be built from another. This means that the warehouse manager needs to integrate with the queuing mechanism of the schedule manager. If data marts are being used, the warehouse manager will be responsible for maintaining them. It will schedule any refreshes of the data marts, and may also be responsible for pushing the data out to the different data mart.
Overnight Database Processing
The first point to note about the nightly processing is that the CPU bandwidth required for the data transformation will depend on the amount of data processing that is involved. Unless there is an enormous amount of data transformation, it is unlikely that this operation will require more CPU bandwidth than the aggregation and index creation operations.
The same applies to the backup, although you should bear in mind that backing up large quantities of data in a short period of time will cause a major kickback onto the CPU. Nonetheless, it is likely that your backup will fit into the CPU bandwidth required by the aggregation and index creation. The CPU load imposed by backup is generally due to the use of many parallel streams to speed up the backup. If the backup is spread over more hours, the amount of parallelism will come down, and its CPU bandwidth requirement will drop.
The data load is another task that can use massive parallelism to speed up its operation. As with backup, if you use fewer parallel streams it will use less CPU bandwidth and will take longer to run. Again, it is likely to fit into whatever CPU capacity is required for the aggregation and index creation. In some cases there will be post processing to be carried out on the loaded data. In some data warehouses this processing can be extremely heavy, and may in fact outweigh the aggregation processing. If that is the case, use the post processing as the baseline for the CPU required rather than the aggregation processing.