ThePort is an excellent example of a real world in-the-trenches product offering real value to customers. One of the most interesting problems they have to solve is multi-tenancy. How do you provide good performance, complete customization, support, develop new features, and provide individual search indexes for each customer? It’s not an easy problem to solve.
How did they solve their problems and build a successful system?
* 6 x Dell blade servers running windows 2008 / IIS 7
* 1 r/w SQL Cluster – dell 6850s (6 single core processors, 32 GB RAM)
* 2 read-only dell 2950 (2 quad core processors, 16 GB RAM)
* 1 distribution server – dell 2950 (2 quad core processors, 16 GB RAM)
* 2 Dell blade servers 8 GB RAM each to total 16 GB of available RAM
* Running SharedCache (Basically an open source .NET port of MemCacheD. We initially looked at MemCacheD but our internal benchmarking indicated SharedCache had better performance – at least w/in a Microsoft environment. We may still investigate Microsoft’s Velocity cache platform when it goes live)
* 2 Dell 2950s with 725 GB Storage
* Running Lucene + SOLR
* We chose Lucene over Lucene.NET because Lucene.NET’s wildcard search was a little buggy in our initial beta testing. SQL Full Text wasn’t a viable option because there was no clear and easy way to split indexes between customers. SOLR cores make this part easy. Above and beyond that, Lucene is lightning fast and is available with features we couldn’t turn down (proximity search, searching w/in documents, and built-in RESTful APIs to name a few)
How do you handle multi-tenancy?
A multi-tenant platform has two primary hurdles to overcome:
1. Preventing a single, large customer from overwhelming the system?
The primary bottleneck for this is in the data layer. Our current DB architecture has helped mitigate this problem. The read-only servers help offset most of this by absorbing the bulk of the data calls. We did have to beef up the distribution server because latency between the r/w server and the read only servers had crept too high. Getting a new machine (2 quad cores with 16 GB of RAM) helped reduce the latency to less than a second.
However robust the cluster is, we’ve concluded that we will eventually have to move to a sharded architecture with MySQL. MS SQL licensing fees makes both continuing to enhance the cluster and scaling out to multiple machines prohibitive. Additionally, sharding allows us to scale either by customer (because some may be more active than others) or by functional area (photos, comments, etc).
2. Allowing clients to have total control over the look, feel, and user experience of their sites.
Allowing CSS control isn’t enough; we needed a templating system that allows total control over the site. We looked at using .NET master pages and user controls to accomplish this. But that assumes a level of knowledge in .NET for outside developers. We built a proprietary templating system that unfortunately became too limiting and would one day lead to a drag on performance.
So we settled on using XML / XSLT. All of our business / entity objects are serializable to XML. This made XSLT a natural choice from the templating angle. We’ve seen a considerable boost in performance from this upgrade and an even greater increase in flexibility in terms of what our designers can do. Once the learning curve is overcome, the web designers love the amount of control they get.
What did you do that was especially cool that people could learn from?
XSLT as Custom Templating System
Building a templating system in XSLT that actually allows the template author to make a web service call to our internal web service layer (or external web services) straight from the templating system. This allowed the development team to build a flexible, powerful system that allows a web designer to embed real-time calls into a given template. We accomplish this using XSLT Extension Objects. What we’ve found in our internal testing is that these extension objects scale way better than our previous templating system (a homegrown proprietary system). We’ve used ANTS profiler to compare the two and the difference is in orders of magnitude.
Obviously we have to cache the hell out of this or the performance of the pages the calls are embedded in would suffer. For now, we make the internal web services calls via HTTP, but we will soon be moving this to a TCP call to take advantage of the better connection pooling offered by TCP. We’re most likely to use WCF because of it’s native support of TCP bindings. However, we haven’t yet benchmarked that so it’s possible it could change.
Not Using the Database to Build Collections
Another cool thing we’ve done is to move strongly away from using the database to retrieve collections of ‘things’. For instance, if we needed a collection of comments, previously we’d hit the database for the 5, 10, 100, etc comments we wanted, do the sorting / filtering in the DB, return a single dataset, cache that, and then display.
However, this is a database intensive operation, especially if you’re going to join against user data (which you inevitably will). What we’ve started doing recently is caching the recent comment objects, and using our cache providers MultiGet ability to simultaneously retrieve all comments at the same time. We then sort / filter in memory in the application tier, discard whatever comments we don’t need, and then display. We found that doing it this way, we save lots of hits to our database and in fact, saw a considerable performance gain from it.
Our tests (on a developer laptop) fetched 10,000 objects from cache in about 1 second, then sorted them by date time in about .015 second.
What prompted you to move to a SOA architecture?
To better compartmentalize our code.
Given the growth of our templating system mentioned above, we realized it was best to truly separate the tiers into discrete areas. Since our application is easily accessed via a set of REST APIs and our own internal skinning system (and who knows what in the future), dividing the application like this gives us a lot of leeway in being able to swap out components. Additionally, we’re doing more and more queuing which lines up nicely with SOA.
Since modern web apps deal with complex data, breaking the work into more discrete operations handled by offline processes on their own infrastructure makes a lot of sense from a performance point of view.
How do you handle consistency between the database and the search engine?
We have a multi-threaded windows service that scans our database once every 5 minutes looking for new data. The service then adds the new items to the Lucene index. We keep audit columns on all our database tables so capturing new data is pretty simple. Once a night, we purge the Lucene index and run a full rescan of the database. We think this system will work for the near to mid term but long term, we’ll take advantage of a queuing system to keep the index in sync.
How you handle your release, support, bug fixing, development, etc.
We have a decent sized dev team. 1 platform architect responsible for overall system architecture (selecting which systems to use, tuning them), 1 lead software architect, and 3 senior – mid level developers. Since we’re a start-up in a fast evolving market (social media) we find that we’re constantly having to adjust to market demands and the latest in social functionality. So we have a 2 month build cycle which is pretty aggressive.
In terms of actual development, we’ve found the following to be keys to success:
1. Daily stand-ups: it’s absolutely necessary for everyone on the team to know what the other is doing. A code base as large as ours, it’s very likely I’m writing a function someone has already written or solving a problem someone has solved previously. Daily stand-ups help with that
2. Iterate: Build the core functionality, get it into QA and / or beta, beat the bugs out of it, move to the next piece. We’ve found this to be easier said than done. Market pressures sometimes dictate you roll with something more feature rich than you’d like. Sticking to an iterative cycle creates better code and more market ready products.
3. Beta test: This goes hand-in-hand w/ #2 above. Get something done and get it in the hands of actual users. This is the best way to find where your app falls down
With regard to support / bug fixing, we’re moving to a forums based support model for many of our customers. We’ve found the same problems, especially in an app as configurable as ours, occur over and over. Getting those answers into an open, searchable format should hopefully cut down on confusion and get developers talking directly to developers.
Internally we use Trac for bug tracking and devote roughly 20% of our week maintaining, supporting, and fixing issues. That may seem like a lot but given how configurable our system is, we’re essentially running 50 heavily data driven websites.
WCF sounds like a buggy underpeforming mess. How is it working?
So far we have no complaints with WCF. I think baking it directly into .NET 3.5 helped iron a lot of the big kinks out. It does come with it’s quirks, no doubt. We built our REST libraries on top of it and found that posting XML is not exactly the easiest thing in the world. But it was more than made up for with the ease in deploying all our GET operations with REST. Our next step will be to set up TCP and MSMQ bindings with WCF to handle our internal service requests and queuing, respectively. Since WCF exposes all of these bindings natively, we think we will see a lot of effective code re-use out of this.
I’d like to thank TJ for taking the time and making the effort to write up their architecture for people to learn from. I’m sure it will help others when they are trying to build their own systems.
You too can share the architecture for your amazing system. Come on, you’ve learned a lot from others, it’s time to return the favor and give back. It’s not that hard, really. If interested please contact me and we can get started.