Data Load in a Data Warehouse

Filed under: General — admin at 5:18 pm on Monday, August 18, 2008

The ultimate goal and the responsibility of the load manager is to get data loaded into the data warehouse database. There are a number of different methods for getting the data from the source into the data warehouse. Some of the more commonly used approaches are: loading from flat files, 3GL programs for extracting and loading data, gateway products for connecting different RDBMSs, and copy management tools.

Some of these tools, such as the gateway products and some of the third-party tools, are designed to select data directly from the source system itself. Others rely on the data being transferred from the source system as a file. If large amounts of data need to be loaded, then it is likely that you will need to use the RDBMS’s own loader software. These loader programs will sometimes have direct path load mechanisms, which bypass the SQL layer of the RDBMS and build whole database blocks directly. This will greatly improve the performance of the load.

Generally, the load performance can be further improved if there is.a no-logging option. This option switches off journaling for the load operation. Note, however, that if this option is used it needs to be taken into consideration in your backup recovery strategy. If no-logging is to be used you also need to design the load jobs to be re-startable.

Data Movement in a Data Warehouse

Filed under: General — admin at 5:22 pm on Wednesday, August 13, 2008

The warehouse manager is responsible for any data movement within the warehouse, such as aggregation creation and maintenance. Any tables, indexes or other objects required will also be created and maintained by the warehouse manager.
Aggregation creation and maintenance is a process that should be automated.

The warehouse manager needs an interface that allows new aggregations to be requested, and old aggregations to be removed. This mechanism can be controlled via tables in the database. Database procedures can be used to insert a row into a table for each new aggregation to be created. The warehouse manager will also need to know what fields are required in the aggregation and what fields to aggregate on. From this information it can build the aggregation and the metadata to describe it.

Most aggregations can be created by a single query. However, the query is often complex and not necessarily the most efficient method of generating the aggregation. It may be possible to generate the aggregation from other aggregations, or by a multiple-stage process with interim result tables. These interim results can often be used as the basis of multiple aggregations, thereby saving time and resource overall. The warehouse manager needs to be capable of taking advantage of these optimizations.

Data Warehouse Manager

Filed under: General — admin at 5:18 pm on Monday, August 4, 2008

The warehouse manager is responsible for maintaining the data while it is in the data warehouse. This layer of software also creates and maintains the metadata that describes where everything resides within the data warehouse. The responsibilities of the warehouse manager are data movement, metadata management, performance monitoring and tuning and data archiving.

Given the size and complexity of most data warehouses it is important to automate as many data management operations as possible. The warehouse manager should be designed to deal with all the tasks mentioned above automatically. Extensive reporting capabilities are a must for the warehouse manager; in particular, it should be able to report any failures of data movement, such as aggregation creation failures. These facts need to be taken into account by the data warehouse manager and proper attention given to the main tasks so that aggregation creation failures are minimized to the best possible extent.

Data Transformation

Filed under: General — admin at 5:11 pm on Saturday, July 26, 2008

An amount of data transformation will almost certainly need to be performed. For example, the source may supply more fields than are required; these fields will need to be stripped during the load. The source fields may be in formats different from that of the RDBMS you are using, requiring the format to be changed. There is often a requirement for a single field to be broken into two, or for multiple fields to be compacted into one. Extra fields, such as a sequence number to be used as a unique identifier, may need to be added. Other fields may need to be derived from some combination of the supplied data and data already in the data warehouse. If more complex data mapping or transformation is required, you will need programs to carry out the transformations. You may need to look at using a copy management or transformation tool.

If the data is being transferred directly via a gateway product into the database rather than by file, the load manager can use the gateway product itself to check how many records are to be transferred. It can then check at the data warehouse end that -the correct number of records have been transferred. However it is done, these checks must be implemented to ensure that all the correct data makes it from source to data warehouse. You also need to ensure that no data gets loaded twice.

Load Manager in a Warehouse

Filed under: General — admin at 5:08 pm on Saturday, July 19, 2008

The load manager is responsible for any data transformation required and for the loading of data into the database. These responsibilities can be summarized as data source interaction, data transformation and data load. The depth and complexity of the requirements placed on the load manager by each of these areas vary radically from data warehouse to data warehouse. This means that the load manager can vary from the simplicity of a few simple load scripts to an enormously involved development involving data transformation tools, database to database gateways and complex load programs.

It is not uncommon for the load manager to start out simple and to grow in complexity as data sources and user requirements are added. It is also possible for the development to go the other way, as data sources merge or are replaced. In particular, if the organization has multiple data warehouses, one can often become a greatly simplified source of data for another.

If a copy management tool is being used to transfer the data from the source system, this task becomes the responsibility of the copy management tool. In this case you need to ensure that the tool is carrying out the relevant checks. For example, you need to ensure that the tool verifies that the whole of a day’s data gets transferred, and not just part of it.

Data Warehouse Process Managers

Filed under: General — admin at 5:06 pm on Friday, July 4, 2008

The data warehouse process managers are pieces of software responsible for the flow, maintenance and upkeep of the data, both into and out of the data warehouse database. There are three different data warehouse process managers - load manager, warehouse manager and the query manager. Each process manager has its own responsibilities and complexities, and there are potential overlaps between the responsibilities of each process. It will help to avoid any ambiguities if we draw some strict boundaries within which each process will work.

The other useful feature of having a database-aware backup recovery manager is that the different parts of the database can be addressed by their database names, and not by the often abbreviated and unreadable file names. One last point worth mentioning; ideally, the backup software used should work across clusters or MPP platforms if that is the environment being used. The backup recovery manager should also be capable of this.

Manging Database Backups

Filed under: General — admin at 5:04 pm on Sunday, June 22, 2008

The backup software will keep some form of database of where and when each piece of data was backed up. The backup recovery manager must have a good usable front end to that database. If it does not, it will be a nightmare trying to find anything to perform a recovery. It is not advisable to get into a situation where you have to trawl through hundreds of tapes trying to find various bits of a file. This would be a very uncomfortable position, particularly if users and management were breathing down your neck.

It is useful for the backup software and the backup recovery manager to be database aware. Being database aware means that the software can be addressed in database terms, and will not perform backups that would not be viable. For example, in an Oracle database objects reside in tablespaces. It is no use backing up a single tablespace on its own unless you have roll-forward journal archiving enabled. If roll-forward journal archiving is not enabled, restoring tablespaces from different times will mean that they are out of sync and could corrupt your database. If the tool is database aware it can prevent mistakes like. this happening. The exception to this rule is a read-only tablespace, which can be backed up and restored separately from the rest of the database. Again, a database-aware tool will know this.

Database Backup Recovery Manager

Filed under: General — admin at 5:02 pm on Saturday, June 14, 2008

As most backup packages have their own tools for managing backup and recovery, the requirement for a backup recovery manager will probably be met by the backup software itself. Nonetheless, it is worth outlining here the abilities that a tool should have.

There are three main features required for the management of backups:
scheduling
backup data tracking
database awareness

It must be easy for operations and management staff to control backup schedules via the tool. The backup recovery manager will need to integrate with whatever schedule manager software is being used. You should check how easy it is to schedule a one off backup of part of the data warehouse. At the end of the day, backups are taken for one reason and one reason only: protection against data loss. Backups are useless if they cannot be used when they are required. There are a number of questions that you need to be able to answer with confidence.

For any given database data file or any file external to the database, do you:
know when it was last backed up?
know where the tape, or other medium, it was backed up to is located?
know where any related files, such as journal files, are located?

Database Diskspace Required for a Data Warehouse

Filed under: General — admin at 3:01 pm on Thursday, May 29, 2008

The ad hoc demands placed on the data warehouse will often require the use of temporary tables to store interim results. Space will need to be set aside for this purpose. That space needs to be efficiently managed. The consequence of not managing it efficiently is that either the space required will be many times larger than necessary, or everything will grind to a halt. lf the system and database management tool allows this space to be efficiently monitored, and allocated space to be dated and tracked, the space can be kept to a minimum, saving many tens of gigabytes of disk space.

As ever, it is the size and complexity of the data warehouse that is the issue here. Large numbers of error log and trace files will be produced. These need to be regularly and automatically scanned for errors or problems. This is one of the functions of the event manager. These files also need to be managed, with old log and trace files being purged to save space and make it easier to deal with the current logs.

Logs that contain significant data should be archived. Logs such as the database log, which contains all major changes, should be kept to maintain a history. This could prove to be important in a crisis situation, where the thread of events over the last number of weeks or months can become important. Any log files that contain errors that have not yet been explained should also be kept until the cause of the error is understood. The tool should allow such indicated log files to be archived, while others are deleted.

How Much Memory does a Database require?

Filed under: General — admin at 11:58 pm on Monday, May 19, 2008

Memory is a commodity that you can never have enough of. What you need to estimate is the minimum requirement. There are a number of things that affect the amount of memory required. First, there are the database requirements. The database will need memory to cache data blocks as they are used; it will also need memory to cache parsed SQI statements and so on. These requirements will vary from RDBMS to RDBMS, an, you will need to work them out for whatever software you are using.

You will also need memory for sort space. Each process that performs a sort will require an amount of sort area. This is usually parameterized, and can be specified by you. Remember that when you use parallel technology you may need to allow enough sort area for each sorting process. This can add up, particularly if you have multiple concurrent queries running.

Secondly, each user connected to the system will use an amount of memory how much will depend on how they are connected to the system and what software they are running. As it is likely that the users will be connected in a client-server mode, the user memory requirement may be quite small. Finally, the operating system will require an amount of memory. This will vary with the operating system and the features and tools you are running. You can get the hardware vendor to estimate how much memory the system will use.