“Thought Experiment” – High Performance Distributed Cache
December 18, 2009I have been thinking a lot about the current structure of data and storage locations across an enterprise and have been feeling like something was wrong with the picture I had formed.
First let me explain that I have come to love simplicity. There is nothing more beautiful than writing 10 lines of code that does something so simple and so native that you have to look at it twice just because it looks too simple to be true.
There is, however, an area that fells like a small battle is being waged – distributed cache. This problem arises after 90% of your read operations are humming along and you want your cache latency to be very small – say within 5 seconds. You have an issue because your legacy systems run on mainframes, you have older VB applications writing and reading data directly from a database and mainframe. There is no governance around information and any application it seems can go in an update sources of data at any time.
The above constraints mean that you have no means of receiving notifications of events or data actions which mean that you can’t wire into a BUS/MSMQ drop to get notified of informational changes. You also can not update these applications because nobody really knows how they work and the risk is seen to be too high by management (and they are right).
The “Perfect” solution
In a perfect would we would be running cache as a first level tier. Your SOA (REST) interfaces would be communicating with some enterprise bus (BizTalk, ApacheMQ, etc) and that bus would exist to broker to any legacy tier and database tier that existed.
Lets trace a call in this architecture…
- A request is made from the UI/APP to update a contact’s address
- The request is forwarded to a REST interface which delegates the call to a ESB
- The ESB picks up the request and forwards it to the primary implementation (see parallel branch)
- The ESB returns back to the REST/UI layer success
[Parallel Branch]
- Thread 1: The ESB identifies a simple object update and saves the data to distributed cache
- Thread 2: The ESB in a separate instance/thread drops a message into a transactional MSMQ with reliable messaging and ordered delivery
The reason this is so perfect is there is barely an IO hit that was incurred because of this read/write operation. All read/write operations were primarily done in memory (at the moment fastest). We get fail over in case our cache goes down because of the MSMQ drop.
The primary hit from web/app layer would hit the service broker in thread 2 (transaction MSMQ hits disk on target machine) but this would be executing well after the result was actually received. Updating data would primarily be as fast as network/CPU/memory allowed and that data would immediately be updated in cache and then a MSMQ message dropped into some further back end for a physical read/write to disk(s).
This implementation is also scalable as distributed cache scales extremely well and the ESB would likely be network/CPU bound which could be solved by horizontal scaling (to no known limit).
In a worst case scenario where your data center goes down all that would happen is your ESB would continue to write ordered messages to MSMQ and the cache would still be running. You’ll just have to hope when you bring the system back up that the back end can keep up
. The business side wouldn’t even notice (provided 99.9% of data is in cache and written/read from as primary). It’s an amazing architecture – one that is used by facebook, youtube, amazon, etc. Very high performance and very scalable. However, I don’t exist in a perfect world.
The “Not-So-Perfect” Solution
With the constraints in place our primary issue is that of notification of data updates. The primary store is still treated as the database by 99% of the applications housed in your enterprise and updates are made in no structured manner. The closest it comes to structured is the database tables themselves. And to be honest, that is good enough for us.
We can keep the ESB like we have before at the cost of complexity (see 99% of other applications wont use them but they each have a lifespan of about 5 years). The core issue is that of cache integrity. If data has changed since last cache than augmenting data at the cache level would result in potential issues when writing back to the store.
This can be avoided by introduction of a “Cache Invalidation Service”. The primary goal of this service is to receive REST/SOAP based messages that are subscribed to from an ESB. Since 99% of applications write/read from SQL instances we can work with SQL Server 2005/2008 query notifications to flush our cache of specific items when they are updated.
At the end of the day
So this will work but there is something bothering me… This all seems far too much like a well defined pattern and this should already be solved by a vendor. The hardest part is identifying the objects throughout the enterprise and then exposing them via REST while leveraging distributed caching. The actual infrastructure after that is easy.
I can imagine in time somebody might come up with a tool/app/service that allows mapping between physical layers to objects and then it would do all the heavy lifting. However, caching is still something that is hand-crafted, which I don’t like handcrafted solutions.
Useful Links for Query Notification
Understanding When Query Notifications Occur
Using SqlNotificationRequest to Subscribe to Query Notifications
