Thursday, December 07, 2006

CCR + DSSP = Distributed Scalable Buffer Devices

Technorati tags: , ,
After much fiddling around and moving of code I've finally got to a point where I can test the implementation of a feature I call distributed buffer device service. Now before I go on to describe what this you might want to read about the CCR and DSSP sub-systems upon which all the service implementations are based...
Okay so where was I? Ah yes, the past two months I investigated whether it was possible (and desirable) to write parts of the database engine as DSSP services.
I had already integrated and converted the code-base to utilise the CCR framework rather than using the difficult to code/debug Asynchronous Programming Model - shame really since I'd become rather good at writing those wrappers!
So this investigative process was really a continuation of existing work. The initial implementation of a Physical Buffer Service was simple enough and even compiled and built without too much hassle however the Container Buffer Service (which is regarded as the minimum service needed to perform useful testing) ran into difficult and damn obscure issues - all related to the generation of the proxy project code.
I have finally (after much hair pulling - very painful considering I've no hair on my head) got this Container Buffer Service to compile and the proxy service to build! Wow!
So what's the point of all this abstraction? Well DSSP allows services to communicate with each other using HTTP and hence each service need not exist on a single machine - now since our services are now DSSP services then we automatically get distributed physical device services - now we didn't have THAT before so this must be considered "progress"...
Now the test harness for these services is actually an NT service - I call this service the "Block File-System Service" and this could well form the underpinnings to the Audio Database service.
The implications of all this is the fact that the database file-group devices will maintain a one-to-many relationship with FileSystem service instances running on potentially multiple machines - sounds super uber scalable to me...
Once I have Container Buffer Services working - it will be time to look at the caching version. Note the caching implementation will provide caching at both ends of the network connection to increase networking performance.

Tuesday, November 07, 2006

Concurrent Pains in my Brain

Long time no post means the new messiah that is the encapsulated within the Microsoft Robotics Toolkit is proving a right devil to implement!

Right now it has caused the creation of four more projects to the overall solution and no end of changes to the code framework!

The most important change is the adoption of DSS (aka Distributed Soap Services) and all devices are being rewritten to take advantage of this concept. The basic idea is to encapsulate all messages between systems in SOAP messages. These messages can then use a unified transport mechanism to reach their destination and with DSS this can be another machine with no further coding!

The first service to arrive from this happy relationship was the PhysicalBufferDevice service. This service is responsible for low-level reading and writing to and from an associated file using asynchronous I/O together with coordination of resizing operations.

The next one up is the ContainerBufferDevice service that deals with clusters of PhysicalBufferDevices.

Following that is the CachingBufferDevice service that not only deals with clusters of PhysicalBufferDevices like ContainerBufferDevice but also uses optimised buffer caching to increase node performance.

Since DSS is being used a new hosting environment was devised to ensure we can control how our DB services are started and obviously control who has access to the service instances.

Still with me? Good! Well all this is wrapped up in the Audio File-System NT service. The purpose of this is to allow the upstream database core to distribute not only files but caching too to multiple machines - this will be extremely scalable and promises to have scope beyond the Audio DB project.

Now you can see why I've been too busy to post... Anyways the NT service is complete and is undergoing final testing. Once this has been completed I will be able to do some proper stress testing and assuming all goes as well as I expect (haha) I'll be able to continue with the next layer up and something tells me that there is another layer in front of what was the next layer - I will be needing a file-system unification layer for all those distributed file-system services...

Sunday, September 03, 2006

Concurrency Messiah

Well it's a funny old game this programming lark and every so often you come across something so ground-breaking that it quite simply takes your breath away - today's breath-taking event concerns a new piece of pre-release software from those good old folk at Microsoft; this .NET toolkit known as the Robotics Studio and despite having robots in mind it comes with a fantastic toolkit for helping with multi-threaded applications and this database is very threaded indeed...

From initial experiments I will be able to fully recode the BufferDevices and all the Locking Primatives to make use of this new technology and seriously reduce the complexity of the underlying software - yes folks it's another piece of reengineering ahead and I think the 27 errors I have at the moment will slowly but surely expand before I get the codebase back under control - shame really as the table row persistence was almost finished too!!

However this is a worthwhile excersize as a fully thread-safe maintainable piece of code is not an easy thing to achieve but right now it is looking entirely possible! I am not looking forward to entering the world of locks - they were a nightmare the first, second and third time around!

Wednesday, August 30, 2006

Transaction Log Testing

It always pays to test your codebase regularly and I have finally got to the point where I can test the transaction logging code and you'd be surprised with the amount of code I have had to write in order to get this far!
However it has also been such a surprise to find that most of the transaction log writing code worked without modification! The only part requiring work was the class responsible for providing a virtual file-system over a stream - this object is crucial for dividing the backing storage used for transaction logs into smaller chunks. It had a problem where it was reporting the stream position incorrectly which led to all plenty of unrelated problems...
So the writing of transaction log records is working but the recovery process still remains untested but in order to conclude that portion of the application I will need to get to a point where the cache writer is correctly saving logged pages and that means I will be retesting the buffer class state-machine which until recently had a few too many states!

Fields, Properties and Serialisation

In a bid to ease the task of writing row and index information to a given page I set out on a major mission (major due to the number of classes that would need modification) to revised the use of explicit member fields and change these into objects that can not only serialise themselves to and from a suitable backing store but also support the concept of being locked.

Locked fields support both read and write of their associated values whereas unlocked fields perform a dummy read of existing data instead of the corresponding write. This makes it easier to update a buffer when extent information of a distribution page changes without causing all distribution page extent changes to require an exclusive object lock - this alone will speed concurrent updates but will need further coordination plus additional LogEntry derived classes to deal with the information actually written in the event of a rollback operation.

As one can imagine with over 100 classes and almost one thousand fields and properties to update this was an erronous task to undertake...

The net result was a true simplification in the implementation of Page object persistence and as an added bonus the persistence of both table and index key information has become so simple I ended up removing classes - that is always a great feeling!

So after a good 50 hours of near continuous programming I have finally got the codebase back to a situation where it builds! It took another 12 hours to fix the variety of bugs and race-condition related problems before the creation of a database is actually working without problem!

No time to rest and relax - I also modified the LogEntry classes to incorporate the same mechanism for reading and writing themselves.

Sunday, August 06, 2006

Async Continues

Page splitting implementation has at long last been converted to a fully asynchronous operation and relatively painlessly too - I must be becoming something of an expert in writing implementations of IAsyncResult as it seems to be getting easier and easier although I must admit to wondering sometimes with all these async wrappers all over the place where the real thread is hiding doing the actual work - scary but ever so true...

It's easier doing this stuff than doing the "real" work I'm dreading - finishing off the index manager and table manager... What's worse is that I'm almost certain I'll need a final wrapper that sits on top of the Table Index Manager and the Table Page Manager and coordinates the actions of both.

It never ceases to amaze me to the sheer number of layers and wrappers this project seems to be creating - I once joked it was like peeling an onion but in actual fact it is more like building an onion!

Saturday, August 05, 2006

Magic Tables, Constraints and Columns

It's been frantic - as the low-level engine edges ever closer to completion my attention has been squarely focused on that mainstay of RDBMS systems known affectionately as tables!

As it happens the table implementation has not proved too difficult to implement thus far and currently support is in-place for the following;


  • Overflow table definition pages

  • Column constraints

  • Data page-splitting



The row block writer and row block reader objects have an initial implementation and the internal row organisation is finished!

The work required to finish the row block persistence can only continue when the Table Index Page/Table Index Manager logic has been completed. This will take a bit longer now that an implementation of clustered indices is required plus the Index Delete/Index Page Combine operations have yet to be written - yet more work!

Magic Tables, Constraints and Columns

It's been frantic - as the low-level engine edges ever closer to completion my attention has been squarely focused on that mainstay of RDBMS systems known affectionately as tables!

As it happens the table implementation has not proved too difficult to implement thus far and currently support is in-place for the following;


  • Overflow table definition pages

  • Column constraints

  • Data page-splitting



The row block writer and row block reader objects are complete and the internal row organisation is finished!

This work needs to be synchronised with ongoing work on the Table Index Manager to ensure we can add/update/delete rows along with appropriate index trees - yippee!

An old column object has been extended to provide serialisation support to both RowReader and RowWriter classes - as one might have expected - to centralise persistence logic and column capabilities. Hence when I get around to supporting User Defined Types or whatever the rest of the code should simply carry on working - famous last words I know...

Saturday, May 13, 2006

Lock Testing

Been busy testing the generic locking class and implementing the specialised derivatives to establish the lock-hierarchy and so far it all works - nice!
Added bonus has been getting the Visual Studio Team System testing framework going again as that is allowing stress testing and such madness to be initiated.

Incidentally it also occurred to me that the locking primatives used to get page level locks could be written to support asynchronous operation (they are fully synchronous at the moment) which might just eke a bit more performance out of them - not sure if I want to go through the pain
Need to reimplement testers for the buffer devices as actually testing the transactioning of a page may well take some time - I dunno I shall see.

The lock manager and the associated lock-owner-block/transaction owner blocks have been rejigged - they are easier to implement now that I'm thinking in terms of how they will be used rather than as the next layer out from the page/buffer implementation! That also made the page logic easier to implement so maybe there is something to learn here...

As ever there is no rest for the wicked and I've turned my attention to tables. The row reader needs careful consideration as this code will need to read both from the table pages themselves and from a result-set defined on search results. I need to think about that some more...

Friday, May 12, 2006

ACID Fundamentals - Locks

I'm wading around in the guts of transaction locking code and it is nothing short of a nightmare! I have a number of generic classes which do all the hard work these are then specialised by final classes for each lock type which deal with the specifics. These "specifics" amount to handling state transitions and determining compatable lock types from different transactions and even the state class is defined within the generic implementation - very clean and somewhat tidy.

Right now I cannot decide whether the escalation behaviour needs a way of being plugged into this generic implementation and I still don't know where to put contextual information such as "whether to hold a read lock until the end of a transaction" as it can't stay in the DatabasePage object...

While I investigate escalation options, I have split the Page lock into seperate sub-types;


  • Database Locks

  • Root Page Locks

  • Object Locks

  • Distribution Page Locks

  • Extent Locks

  • Schema Locks

  • Page Locks



Each of these locks has slightly differing state logic and this is the cleanest way of dealing with that.

Added a collection object to the transaction context which allows transaction context to track the objects which have outstanding locks during the lifespan of a transaction. This is very important when using ReadCommitted (with hold lock) isolation and above as these locks will be released after the commit/rollback has happened - hence the best place for that is the transaction context - not the lock manager!

Weekend should involve more major work on this project - it's Friday already and I should be sleeping!

Tuesday, May 09, 2006

Async Wrappers

The implementation is once again making more and more sense - proof of my own self-delusional state or perhaps proof that the project and the design are going in the right direction! Now fleshing out the data device initialisation/mounting code and that is proving not too strenuous. Need to get the root page and distribution page init code sorted - then I'll be able to see the log-writer do its work - I can't wait!

I was under the mistaken impression that these updates to promote asynchronous behaviour and removing swaths of class hierarchy were going to make the app a dash simpler but as I found out the stack trace during writes is actually longer than before - lots and lots of async wrapper objects the root of the issue - I may need to assign these wrappers from a pool of the things to keep the C# memory manager happy but then again this is what .NET is all about so I'll just flag it for now!

Taken a brief look at the index manager implementation - which still looks rather slick with it's generics all over the place and crossed another TODO off the list... I created an initial implementation of B-Trees operating over pages ages ago but the code was totally synchronous. I realised it needed some careful rework in order to get it to work efficiently with the BeginLoadPage/EndLoadPage APIs that have cropped up following the async conversions and today I can happily say I have solved it with some of the scariest code I've ever written!! Not scary for it's complexity - it is some of the most elegant encapsulated OO code you'll ever meet - no, what scared me was the fact that when I started I didn't actually think the task at hand was entirely possible - writing the B-Tree handler in the first place was nothing short of pain and misery...

Worse yet I still need to provide the B-tree implementation for the table index manager - similar but with the added complication of defining a class hierarchy for dealing with the different data-types I plan to support and the obvious headaches involved in doing string comparisons... I have never understood the various collations - ever...

Anyways I'm happy - my writing of asynchronous code has come of age - almost to the point where I am considering writing an article on just that! Watch this space for a URL...

Need to revisit the locking implementation as it currently needs too much information some of which is not present until the page is loaded - a situation I am keen to avoid...

Database Devices Cracked

Wahey! Good news!

Page level functionality is now tested and working. This includes the behaviour of the CheckPointer and the Log Writer although in the case of the latter the recovery process remains untested and uncharted waters!

Extensive debugging was been achieved since the installer classes were rewritten to emcompass the new class framework (the test harness utilising the installers was already in existence) and now the Buffer state-machine has been tested along side the asynchronous behaviour exposed by just about every buffer/page class available!

It's not been plain sailing though - the implementation of the Free Buffer Service had to ensure Data Page device buffers were marked transactional and log buffers were not... The Buffer state machine pattern also unearthed a number of peculiarities which caused the moving of some of the state switching logic and I also found out that the NestedContainer object does not forward calls to GetService to the owner component - so I had to write one that did in order to get my own proprietry routing chain to work!

Now that the Log Writer appears to be working - at least in an initial capacity (I've just had to changed the default log-page block size - it was far too large) my attention will now turn to initialisation of the primary data device and the setup of file-group primary devices which share more than a few attributes.

After that the real hard work begins!

Saturday, May 06, 2006

Buffer Internals

Finally the asynchronous support has been completed. :-D and as a result the BufferDevice hierarchy is far simpler and the PageDevice hierarchy is also far flatter.

So now I have two class hierarchies - one deals with buffers and can be considered the low-level API and the other deals with pages and can be considered the next level up the chain. The buffer handlers were surprisingly easy to write - more a question of moving code from various other classes which was made a nice change.

The page classes were also surprisingly straightforward especially given the fact that I chopped lots and lots of classes out!!

The real pain came when I decided to unravel the state machine for buffer objects - this turned into a week-long adventure but the result is an incredibly flexible finite-state-machine which ensures consistent buffer state transitions and proper state handling without littering the Buffer class with lots of boolean flags! This task was necessary in order to make the object fully asynchronous...

The buffer can now support the following async operations;

  • Read From Device Stream

  • Write To Device Stream

  • Write To Log Device



With this support in buffers and their corresponding devices complete, attention now turns to pages and their devices. There are a couple of loose ends which still need to be tied up with regard to hand off and lock acquisition - I also need to ensure the transaction handler will correctly unlock pages during the commit phase...

Finally the outer DatabaseDevice device can be completed with regard to recovery and the final mounting procedure before I once again fix the installer classes (and use them to test the initial portion of the codebase).

Sunday, April 16, 2006

Asynchronous Persistence

Major change is underway in order to clean the implementation of asynchronous persistence used through out the database classes.

Seemed to me that there was far too many methods and more than a little confusion in the implementation so I've simplified the arrangement of devices with respect to loading, saving and initialising DeviceBuffer objects.

These simplified BufferDevice objects will handle the low-level buffer load/save/init operations and be wrapped by a single class used to handle page-level processing.

This arrangement will be easier to test and have better performance due to a significant reduction in the number of method calls...

It means revising an awful lot of code which is more than a small pain but one well worth taking on!

Monday, April 10, 2006

Transaction Logging

Installation testing is going very well and more of the various subsystems are undergoing functional testing now. The maze of tasks involved with installing hierarchical devices have largely been solved so now attention has turned towards the creating .NET transactions for wrapping the overal install and getting the custom transaction implementation to enlist itself into this framework feature.

Well the enlistment part was fairly straightforward with the designed classes needing only minor modifications in order to start working however the changes needed to handle saving transacted pages involved a little more head scratching and code tweaking! The freshly written code was saving pages and their associated buffers directly to the underlying device - this is clearly illegal if you want a recoverable system!

Ultimately the code was modified for transacted buffers such that calls to SavePage will update the database transaction holder with information pertaining to the buffer and the current timestamp. During the Commit-Preparation phase these buffers can be committed (scratchpad data moved to write-pending area) and the transaction log records can be generated from the two images (or single image in the case of newly initialised buffers). Finally in Commit phase the Commit log record is written to validate the log records.

That's the idea at least - so far the distribution pages are following this regime fine but root pages seem to have a mind of their own - it's that or I'm not actually saving them...

All this work meant I needed to provide an implementation of the CheckPoint handler at long last and thankfully this has proved relatively easy - the only real problem seems to be with the cache management threads which don't seem to be unlocking the cache in all scenarios - still at least I know where the problem is - multithreaded mayhem can be a pain to debug but .NET gives us flexible tracing!

Development should move into overdrive following the arrival of a new desk and chair for comfortable programming however the country-wide mayhem that is Songkran now lies directly in my path so it's all on hold for the next 5 days or so - fun fun fun (with a water gun!)

Sunday, April 09, 2006

Installers Reach Runnable Stage

Been working like a demon despite being on holiday from my holiday in Hanoi!!!

The installation components are now being rigourously tested by a new test-harness and as a result the call sequence and the mount operations performed by devices have been debugged and tweaked so that now the installation completes without problem.

It is still not a complete success - I need to check the log-writer is logging the writes to the root pages and I need to check the data device root pages are correct. After all that is done I will be able to test the non-create mount operation and move delicately onto testing the recovery logic which could be quite painful!

Found that VS2005 disconnected check-outs from Source Control work like a dream and my edits are checked in successfully and with a minimum of fuss (well once it worked out the network drive was back online that is...)

Installers Reach Runnable Stage

Been working like a demon despite being on holiday from my holiday in Hanoi!!!

The installation components are now being rigourously tested by a new test-harness and as a result the call sequence and the mount operations performed by devices have been debugged and tweaked so that now the installation completes without problem.

It is still not a complete success - I need to check the log-writer is logging the writes to the root pages and I need to check the data device root pages are correct. After all that is done I will be able to test the non-create mount operation and move delicately onto testing the recovery logic which could be quite painful!

Found that VS2005 disconnected check-outs from Source Control work like a dream and my edits are checked in successfully and with a minimum of fuss (well once it worked out the network drive was back online that is...)

Friday, March 31, 2006

Log Device and File-Group Device Installation

It would appear that log devices and file-group devices have more than a few things in common - not quite enough in common to justify deriving log devices from file-group devices but it's certainly close. The installers for both these classes also share a certain amount of commonality but this thankfully is limited to ensuring each installer has a primary physical device among its children (and only one at that) so no great problem!

I probably shouldn't have said that - I might yet need to modify the searching code to look through the child tree rather than limiting the search to immediate children as I may want to wrap physical devices within a transaction installer wrapper for example.

I will leave the wrapping of installers in transactional installers to cover installing the log device and each database file-group as these are autonomonous operations (perhaps even wrapping log device and primary file-group under one transaction and everything else under another)

As you can see - I've not quite decided on the right way to go at this time!

Before finishing work on the installers I need to finish the implementation of performance counters and satisfy myself that I have enough support for event-logging within the current logging object - then the project installer located in the DB class library can be moved into the Installer library!

No rest for the very very very wicked eh?

Sunday, March 26, 2006

Installer Work

More work on the installation/initialisation aspects of the DB engine and the database physical device installer now correctly writes the distribution pages to disk following successful installation of the underlying physical file.

The Device installer base class has been tweaked so that fits into the .NET framework installer system better and so far it looks good.

With physical data device installers approaching completion - only the writing of the actual root page needs to be done there - I'll move on to the file-group device installer. This installer class needs to update the root page of each device so as to identify the other members of the filegroup. This will make attaching a filegroup easier for users as they will not have to find the primary file.

Next up will be the log device installer and finally I can look at the final part of the puzzle - the databse installer itself!

Saturday, March 25, 2006

Refactoring Revisited & Installer Heaven

Well I have been a busy bee indeed as programming operations have moved from the United Kingdom to Thailand - the weather is warmer, the cost of living cheaper and since the sun shines more often - I tend to feel more inclined to write code and good code at that!

Reorganised the class library in what should be the last time - now it follows a similar convention to .NET itself so all generic classes are in a root namespace and base device and page classes are grouped together in a componentmodel namespace.

This has helped no end in reducing the number of namespace entries within each code file significantly.

The component libraries are now signed with a strong name and I have also created a reference class which is shared via source control to each of the DB assemblies - this assembly reference contains strings which simplify the task of referencing other assemblies in type names and so on.

I have also started work on installer classes to facilitate the installation/uninstallation of database device components (rather handy they are components eh?) This work has changed the focus to how these devices are installed and where the responsibilities lie in dealing with scope of work and in definition of pages.

A few things have fallen out of the installer work - I have a base DeviceInstaller class (derived from ComponentInstaller) from which a hierarchy of installer classes are derived from.
Installing a device involves execution of the following;

  • Creation of installation-time device component

  • Attaching to parent device (as necessary)

  • Pre-mounting tasks

  • Create/mount of device

  • Post-mounting tasks



The logic controlling the sequencing of these operations is encapsulated within the DeviceInstaller base class.

Pre-mounting tasks involve setting up the static device settings. The ID and Name are handled by the base class with everything else being handled by the appropriate derived class.

Post-mounting tasks involve the setting up of device pages as determined by the installer. DatabasePhysicalDeviceInstaller will create distribution pages here and update the root page information (as created by the PhysicalDeviceInstaller)

Certain installer classes will need to record state information to facilitate rollback or uninstall operations.

Wednesday, February 01, 2006

Lock Manager

The lock object pattern has been extracted and all page locking primatives have been coded!

Now looking at lock owner block objects which track locks per object and transaction lock owner block objects which track lock owner blocks for a given transaction.

LOBs are used to track pages locked for a given object and determine when the lock manager will escalate a given lock to a full parental lock.

A transaction may consist of several LOBs and these are tracked within a TLOB.

I still need to nail down the full lock hierarchy and once this activity has been performed I will be able to finish off the lock manager and implement the lock escalation logic too!

Monday, January 23, 2006

Got more Locks than Chubb!

Currently waist deep in locking entrails having revisited for the fourth time the locking implementation code!

I have now developed a generic locking class which uses a state machine to control the behaviour of the locking object. Each concrete specialisation provides the nested state classes used by the generic lock class. Currently the generic lock has concrete specialisations for the page locking implementation and for schema locking.

Schema locks (in case you were wondering) are used to control updates to the table schema or sample definition block. This ensures that query builders can lock the schema (read access) and build their query before getting a suitable lock on the table and releasing the schema lock. To modify the schema a connection will have to wait until all readers have released and all bulk updates have been completed. At this point the schema exclusive lock is granted and an exclusive table lock can be acquired to facilitate update of the table rows (or in the case of a sample - update the sample data.)

Bouyed by this coding advance I started simplifying the logic inside the Lock Manager class by creating a generic lock handler class. This class knows how to maintain both a hashtable lookup for active locks as well as a free lock queue which is populated as active locks are released. So far so good and it all works however I have been unable to integrate the RLock (aka Resource Lock) into this scheme. I will probably end up writing a concrete version of the lock handler specifically for RLocks - if only to ensure that the free lock queue thread (yet to be written but it ensures the free lock queue is populated ready for action.) can be implemented in a way so as all these handlers exposed a particular interface - say ILockHandler for example.

When that is complete they'll be no time for larking about - I need to write the Lock Owner block code which is used to determine when (and how) lock-escalation occurs.

I feel tired already!

Monday, January 09, 2006

Locking Revisited

Isn't it strange how when you think you have something sussed it turns around a bites you in the most unexpected manner... This project is no alien to the world of surprise and this time it was the turn of the current page lock implementation...


Lock that Devil Down

The more I looked at the code I'd written - the more it seemed as if I'd completely buggered up the implementation of intent-locks. On closer inspection it was worse than that and highlighted my early misunderstandings regarding the transaction locking classes. I was trying to treat the different lock levels (hierarchically speaking) as almost independant - cept that they are not independant! I also tried to create a transaction based reader/updater/writer lock - this works as far as it goes but the attempt to use two of these locks to achieve a hierarchical page locking mechanism was optimistic!


Better the Devil...

The revised lock object (which for the moment I've called PageLockEx) now handles all locking within the class itself... It is also a finite state machine which is nice...
The locking modes supported are;


  • Shared [S]

  • Update [U]

  • Exclusive [X]

  • Intent Shared [IS]

  • Intent Exclusive [IX]

  • Shared with Intent Exclusive [SIX]


The intent locks are used to establish a lock hierarchy chain and speed the operation of locks as high level locks to not need to enumerate descendant locks before deciding on whether to grant a requested lock - nice! Makes the implementation a right dogs breakfast tho - not so nice!

The first draft did make me realise one thing though - lock escalation isn't blind and neither is lock parent propagation. So I'll get that bit right this time round (well I better had - I've never had to sack myself before and don't fancy starting anytime soon...) The other thing I missed was an implementation of three more locks (Schema Update locks, Schema Stability locks and Bulk Update locks) and these will be catered for using bolt-on classes. These classes are only ever bolted to locking primatives which relate to table objects.

The only other question left to ask is how on earth is this page locking object going to be tested? Back to the mumbles and half murmors until I've thought some more about test-plans and so on....!!

Once I've got this hammered into the shape I wanted it in the first place I'll have to revisit the allocation logic which I was hoping to be a nice simple implementation - however since I've gone and created super flexible devices it seems like nothing but the complex solution will do! Anyway - that can wait for another day!

Thursday, January 05, 2006

Database Allocation

Work on finishing the database allocation process is getting closer to completion.

Extents - An extent is a contigious block of eight pages.

Mixed Extents - These are extents which contain pages owned by multiple objects.

Uniform Extents - These are extents which contained pages owned by a single object.

This part of the logic deals with updating distribution pages and providing an extent management solution. Currently the extents are managed within the distribution pages themselves however this may need to be moved to ease the task of determining when objects switch from mixed-extents to uniform extents.

Once the extent management is finialised then locking can be added to ensure only a single transaction can modify a given extent at any one time.

To modify an extent you will need an exclusive lock on the extent and a shared intent lock on the distribution page in which that extent lies.

When a distribution page becomes full then a new distribution page must be used or allocated (513 pages from the last) and the process repeated.

Thankfully the device objects already know how to change size and container devices know how to pick the most suitable device for expansion so once a device has been selected for expansion and indeed expanded the final work left to do is in preparing the pages. When shrinking a device the same operations need to be done but in reverse order...

It remains to be seen as to whether it is a good idea to checkpoint after device allocations but the system will certainly checkpoint after adding or removing devices!

Tuesday, January 03, 2006

Page Buffers and Idents

Pages are the logical view of a page however to abstract the page implementation from the persistence logic the concept of Page Buffers was created.

Page Buffers are the device-level view of pages. These objects track where they point to both in terms of a virtual ID and a logical ID.

Virtual Page ID - Contains the unique device ID (unique to a given database that is) and the physical page ID (which ranges from 0 to the number of pages allocated on the device).

Logical Page ID - Provides an abstraction for the virtual page ID to ease the task of moving pages.

The page cache (which incidentally is implemented as a device class) also contains the implementation of the Free Buffer manager (which ensures we always have a ready supply of buffer objects when we need them), the Read Manager (which will be expanded to incorporate the Read Ahead manager), and the Lazy Writer. The Lazy-Writer also deals with check-pointing - under these conditions extra threads are spawned to write all dirty cache pages to disk as quickly as possible.

Q. So What Is an Audio Database?

A. Simply put, the Audio Database is essentially built on top of a "page" server.

Q. Well that's simple... but what exactly is a Page Server?

A. Here's where it gets interesting...

Page Server


A page is simply a block of memory - currently fixed in size at 8192 bytes (8Kb).
The idea is that each database file is divided into pages and the page server maintains a memory cache of loaded/modified pages.

To facilitate recovery, if a page is changed then the changes must be logged to the transaction log before the page itself is written to disk (lazy write).

Once the page change has been logged we are free to save the actual data page at any time we choose (since the change itself is logged and can be recovered).

Forced Saves
To ensure consistency the page server will occassionally perform a forced save of all cached pages - this is known as a checkpoint operation. The checkpoint is very important as this defines the last known recovery point when the page server is restarted.

Still with me? Good!

Recovery
Clearly when a database is brought online there are some extra recovery steps which must be performed before the data store is synchronised with the logged transactions.

The process is known as recovery. During recovery all log records between a check-point start and a check-point end are analysed.

All committed transactions are rolled forward (ie: the log entries are compared with the pages and if the logged change has not been performed on the page then it is redone).

All rolled-back transactions (including those still in progress at the time of the EndCheckpointRecord) are rolled back (ie: the log entries are compared with the pages and if the logged change has been performed on the page then it is undone).

At the end of the recovery operation any transactions which were implicitly rolled back will have Rollback log records added.

Finally a checkpoint will be issued and this marks the end of the recovery phase and the page-server is open for business.

Welcome to my Audio Database

Welcome to the Audio Database blog



The purpose of this blog is to highlight the pleasures and pains I encountered during the development of a RDBMS system written entirely within the confines of the Microsoft .NET Framework.

The project started life six months ago while I was re-reading a copy of Inside MS SQL Server 6.5... The description of the internal server architecture proved too close to pseudo-code for me not to have a go at pulling together my own version of such a beast.

Futile though this task may seem, I do have an end goal! The application is designed to support streaming audio/video data and it is designed to handle multiple requests from multiple clients. It will also support the attachment of multiple audio streams to a given video clip.

The database itself supports full ACID properties and a fully recoverable transaction log which records page changes.

Databases can span multiple files and can be organised into sub-groups known as file-groups to increase performance by ensuring certain allocations are placed on a given group of devices.

The page engine manages an in-memory representation of the underlying physical data stores and keeps this concurrent and synchronised by way of a transaction log and suitable locking mechanisms which ensure only a single transaction can ever update a given page at any given time.

As you can imagine this is no small undertaking!

The task list is daunting;

  1. Device class hierarchy
  2. Page class hierarchy
  3. Page cache
  4. Lazy writer
  5. File-groups
  6. Locking
  7. Transactions
  8. Logging
  9. Recovery
  10. Physical Page Allocation
  11. Logical Page Allocation
  12. Database Page Allocation
  13. Table/Sample/Video Index Managers
  14. Table Manager
  15. Sample Manager
  16. Video Manager


Those tasks are all needed to complete the database engine alone!

Over the past few months Items 1 through 8, 10 and 11 have been completed and currently lie untested (gulp!)
with item 8 (recovery) on the critical path.

Once the recovery implementation is complete I can start testing the paging and recovery logic - this will be extensive and will require among other things being able to disable the lazy writer to check the behaviour of the system during recovery scenarios.

After that the allocation logic can be finished off. Currently the device expansion is implemented and the mechanism for allocating a new logical ID is there together with a simple algorithm for picking the best device when expanding a file-group. The only logic missing from the allocation support is the updates needed for the distribution pages (which track database page allocations).

With allocation complete work on the index managers finalised. Unlike SQL Server I will not be supporting clustered-indices although I do share the B-tree implementation - the test harness for proving the algorithm was fun to write I can tell you...

With completed index managers I can finish the three object managers - easy!