First of all I would like to say thanks to Fernando and the rest of the team that works on the Social website - I know from experience that all you ever hear about when you run a customer-facing site is what's wrong, not what is right. Despite the glitches, I've been pretty happy with the functionality. As long as it continues to improve, I won't have any complaints.
I'm sure you have a talented team there who has already given this a lot of thought, but having dealt with my own large-scale content delivery issues (especially the type that require a lot of data synchronization), I thought I would share with you some of my personal experiences that eased the pain.
Forum search: I think by far the easiest way to fix this is to just shell out a bit of cash for a Google search appliance. They're really affordable, and you can index all of the database content with it. It requires very little time from your devs to set up. You could literally have search functionality purchased, QA'ed, and deployed in a week.
Database queueing: From Fernando's post in the main forum (by the way, thanks a TON for that level of transparency), it sounds like one of the big issues was queueing all of the user data and loading it into the master database before it was replicated out to the slaves. It looks like the issue has been resolved, but here is my experience in case you feel there is still work to do. You probably want a multimaster setup across different geo locations. This is where a cloud compute infrastructure really works well. Using something like EC2 or GoGrid will allow you to scale quickly - if you plan ahead you can literally deploy new compute cores in minutes. Then you can have a user's data sent to the queue that is closest to their geo location, and the master databases synchronize between themselves. I don't know what database you are using on the backend, but even mySQL has pretty good support for that these days.
Data synchronization: this is typically the downfall of the database queueing above, which is why I'm sure your team decided to go with a single master database that replicated out to slaves. Replicating databases across a local cluster is hard; replicating across multiple locations is even harder. While I haven't yet tried it, I am considering EC2's new cloud VPN for this purpose; it effectively unites your internal machines with their compute cores as if they were all part of a single network. They also offer persistent block-level devices now, so you don't have to worry about your data just vanishing.
Yet another approach is to multicast the data that you have in the upload queue. I have solved some *really* difficult scalability problems that way. I used caching software called memcached for this. Memcached "knows" where to look for all of the data in a pool of caches, even if they are in different physical locations. I would use it to hold the queued data, and searched there before I went to the database. That way even if a queue was lagging behind (say, updating a forum post), I could still pull that data from the queue *before* it got into the database.
Serving content: As far as some of the issues go with downloadable content, this is where using a CDN like Akamai works really well. Granted you still need to handle all of the authentication and DRM on your end, but then you can just use the CDN for the download of the actual content. They will be able to do it more cost effectively and quickly than you can with your own servers (and spread the load over the whole globe for your international customers).
Anyway, those are just some of the thoughts off the top of my head based on my own experience with similar problems. Hopefully you will find some of them useful or they will trigger ideas of your own.
Thanks again for your work on this - I know how underappreciated developers on large-scale customer websites can be!
ATTN site devs: thoughts on infrastructure scaling and database synchronization
Débuté par
Chameleon Skin
, nov. 13 2009 01:38





Retour en haut






