Monday, January 23, 2006

Got more Locks than Chubb!

Currently waist deep in locking entrails having revisited for the fourth time the locking implementation code!

I have now developed a generic locking class which uses a state machine to control the behaviour of the locking object. Each concrete specialisation provides the nested state classes used by the generic lock class. Currently the generic lock has concrete specialisations for the page locking implementation and for schema locking.

Schema locks (in case you were wondering) are used to control updates to the table schema or sample definition block. This ensures that query builders can lock the schema (read access) and build their query before getting a suitable lock on the table and releasing the schema lock. To modify the schema a connection will have to wait until all readers have released and all bulk updates have been completed. At this point the schema exclusive lock is granted and an exclusive table lock can be acquired to facilitate update of the table rows (or in the case of a sample - update the sample data.)

Bouyed by this coding advance I started simplifying the logic inside the Lock Manager class by creating a generic lock handler class. This class knows how to maintain both a hashtable lookup for active locks as well as a free lock queue which is populated as active locks are released. So far so good and it all works however I have been unable to integrate the RLock (aka Resource Lock) into this scheme. I will probably end up writing a concrete version of the lock handler specifically for RLocks - if only to ensure that the free lock queue thread (yet to be written but it ensures the free lock queue is populated ready for action.) can be implemented in a way so as all these handlers exposed a particular interface - say ILockHandler for example.

When that is complete they'll be no time for larking about - I need to write the Lock Owner block code which is used to determine when (and how) lock-escalation occurs.

I feel tired already!

Monday, January 09, 2006

Locking Revisited

Isn't it strange how when you think you have something sussed it turns around a bites you in the most unexpected manner... This project is no alien to the world of surprise and this time it was the turn of the current page lock implementation...


Lock that Devil Down

The more I looked at the code I'd written - the more it seemed as if I'd completely buggered up the implementation of intent-locks. On closer inspection it was worse than that and highlighted my early misunderstandings regarding the transaction locking classes. I was trying to treat the different lock levels (hierarchically speaking) as almost independant - cept that they are not independant! I also tried to create a transaction based reader/updater/writer lock - this works as far as it goes but the attempt to use two of these locks to achieve a hierarchical page locking mechanism was optimistic!


Better the Devil...

The revised lock object (which for the moment I've called PageLockEx) now handles all locking within the class itself... It is also a finite state machine which is nice...
The locking modes supported are;


  • Shared [S]

  • Update [U]

  • Exclusive [X]

  • Intent Shared [IS]

  • Intent Exclusive [IX]

  • Shared with Intent Exclusive [SIX]


The intent locks are used to establish a lock hierarchy chain and speed the operation of locks as high level locks to not need to enumerate descendant locks before deciding on whether to grant a requested lock - nice! Makes the implementation a right dogs breakfast tho - not so nice!

The first draft did make me realise one thing though - lock escalation isn't blind and neither is lock parent propagation. So I'll get that bit right this time round (well I better had - I've never had to sack myself before and don't fancy starting anytime soon...) The other thing I missed was an implementation of three more locks (Schema Update locks, Schema Stability locks and Bulk Update locks) and these will be catered for using bolt-on classes. These classes are only ever bolted to locking primatives which relate to table objects.

The only other question left to ask is how on earth is this page locking object going to be tested? Back to the mumbles and half murmors until I've thought some more about test-plans and so on....!!

Once I've got this hammered into the shape I wanted it in the first place I'll have to revisit the allocation logic which I was hoping to be a nice simple implementation - however since I've gone and created super flexible devices it seems like nothing but the complex solution will do! Anyway - that can wait for another day!

Thursday, January 05, 2006

Database Allocation

Work on finishing the database allocation process is getting closer to completion.

Extents - An extent is a contigious block of eight pages.

Mixed Extents - These are extents which contain pages owned by multiple objects.

Uniform Extents - These are extents which contained pages owned by a single object.

This part of the logic deals with updating distribution pages and providing an extent management solution. Currently the extents are managed within the distribution pages themselves however this may need to be moved to ease the task of determining when objects switch from mixed-extents to uniform extents.

Once the extent management is finialised then locking can be added to ensure only a single transaction can modify a given extent at any one time.

To modify an extent you will need an exclusive lock on the extent and a shared intent lock on the distribution page in which that extent lies.

When a distribution page becomes full then a new distribution page must be used or allocated (513 pages from the last) and the process repeated.

Thankfully the device objects already know how to change size and container devices know how to pick the most suitable device for expansion so once a device has been selected for expansion and indeed expanded the final work left to do is in preparing the pages. When shrinking a device the same operations need to be done but in reverse order...

It remains to be seen as to whether it is a good idea to checkpoint after device allocations but the system will certainly checkpoint after adding or removing devices!

Tuesday, January 03, 2006

Page Buffers and Idents

Pages are the logical view of a page however to abstract the page implementation from the persistence logic the concept of Page Buffers was created.

Page Buffers are the device-level view of pages. These objects track where they point to both in terms of a virtual ID and a logical ID.

Virtual Page ID - Contains the unique device ID (unique to a given database that is) and the physical page ID (which ranges from 0 to the number of pages allocated on the device).

Logical Page ID - Provides an abstraction for the virtual page ID to ease the task of moving pages.

The page cache (which incidentally is implemented as a device class) also contains the implementation of the Free Buffer manager (which ensures we always have a ready supply of buffer objects when we need them), the Read Manager (which will be expanded to incorporate the Read Ahead manager), and the Lazy Writer. The Lazy-Writer also deals with check-pointing - under these conditions extra threads are spawned to write all dirty cache pages to disk as quickly as possible.

Q. So What Is an Audio Database?

A. Simply put, the Audio Database is essentially built on top of a "page" server.

Q. Well that's simple... but what exactly is a Page Server?

A. Here's where it gets interesting...

Page Server


A page is simply a block of memory - currently fixed in size at 8192 bytes (8Kb).
The idea is that each database file is divided into pages and the page server maintains a memory cache of loaded/modified pages.

To facilitate recovery, if a page is changed then the changes must be logged to the transaction log before the page itself is written to disk (lazy write).

Once the page change has been logged we are free to save the actual data page at any time we choose (since the change itself is logged and can be recovered).

Forced Saves
To ensure consistency the page server will occassionally perform a forced save of all cached pages - this is known as a checkpoint operation. The checkpoint is very important as this defines the last known recovery point when the page server is restarted.

Still with me? Good!

Recovery
Clearly when a database is brought online there are some extra recovery steps which must be performed before the data store is synchronised with the logged transactions.

The process is known as recovery. During recovery all log records between a check-point start and a check-point end are analysed.

All committed transactions are rolled forward (ie: the log entries are compared with the pages and if the logged change has not been performed on the page then it is redone).

All rolled-back transactions (including those still in progress at the time of the EndCheckpointRecord) are rolled back (ie: the log entries are compared with the pages and if the logged change has been performed on the page then it is undone).

At the end of the recovery operation any transactions which were implicitly rolled back will have Rollback log records added.

Finally a checkpoint will be issued and this marks the end of the recovery phase and the page-server is open for business.

Welcome to my Audio Database

Welcome to the Audio Database blog



The purpose of this blog is to highlight the pleasures and pains I encountered during the development of a RDBMS system written entirely within the confines of the Microsoft .NET Framework.

The project started life six months ago while I was re-reading a copy of Inside MS SQL Server 6.5... The description of the internal server architecture proved too close to pseudo-code for me not to have a go at pulling together my own version of such a beast.

Futile though this task may seem, I do have an end goal! The application is designed to support streaming audio/video data and it is designed to handle multiple requests from multiple clients. It will also support the attachment of multiple audio streams to a given video clip.

The database itself supports full ACID properties and a fully recoverable transaction log which records page changes.

Databases can span multiple files and can be organised into sub-groups known as file-groups to increase performance by ensuring certain allocations are placed on a given group of devices.

The page engine manages an in-memory representation of the underlying physical data stores and keeps this concurrent and synchronised by way of a transaction log and suitable locking mechanisms which ensure only a single transaction can ever update a given page at any given time.

As you can imagine this is no small undertaking!

The task list is daunting;

  1. Device class hierarchy
  2. Page class hierarchy
  3. Page cache
  4. Lazy writer
  5. File-groups
  6. Locking
  7. Transactions
  8. Logging
  9. Recovery
  10. Physical Page Allocation
  11. Logical Page Allocation
  12. Database Page Allocation
  13. Table/Sample/Video Index Managers
  14. Table Manager
  15. Sample Manager
  16. Video Manager


Those tasks are all needed to complete the database engine alone!

Over the past few months Items 1 through 8, 10 and 11 have been completed and currently lie untested (gulp!)
with item 8 (recovery) on the critical path.

Once the recovery implementation is complete I can start testing the paging and recovery logic - this will be extensive and will require among other things being able to disable the lazy writer to check the behaviour of the system during recovery scenarios.

After that the allocation logic can be finished off. Currently the device expansion is implemented and the mechanism for allocating a new logical ID is there together with a simple algorithm for picking the best device when expanding a file-group. The only logic missing from the allocation support is the updates needed for the distribution pages (which track database page allocations).

With allocation complete work on the index managers finalised. Unlike SQL Server I will not be supporting clustered-indices although I do share the B-tree implementation - the test harness for proving the algorithm was fun to write I can tell you...

With completed index managers I can finish the three object managers - easy!