The Computer Science Behind Facebook's 1 Billion Users Source: timothy
"Much has been made about Facebook hitting 1 billion users. But Businessweek has the inside story detailing how the site actually copes with this many people and the software Facebook has invented that pushes the limits of computer science. The story quotes database guru Mike Stonebraker saying, 'I think Facebook has the hardest information technology problem on the planet.' To keep Facebooking moving fast, Mark Zuckerberg apparently instituted a program called Boot Camp in which engineers spend six-weeks learning every bit of Facebook's code."
Actually, Facebook's problem isn't trivial in any sense of the word. The complexity and joins of various database tables must be insane. With YouTube it's all about raw bandwidth, which actually is a fairly easy problem to solve especially since 99% of that data is static. You just physically distribute it and throw money / resources at the problem. As far as database structure, any CS student should be able to reproduce the bulk of it in a single day. You have videos associated with users, and comments associated with videos, etc. The gist of it is straightforward.
Now let's talk about Facebook. There is no compartmentalization of the data. You've heard the "six degrees of separation", whereby any two people on the planet can be socially connected to one another in at most 6 steps. Well, with Facebook, the average degree of separation between any two people is 3.74. What that means is everyone is very closely networked, all the data is dynamic (or more specifically, the data the users really care about is the dynamic and most recent data), and since many people (myself included) open up their information to "friends of friends", there is a tremendous amount of data that any one person can potentially have access to. Even Google searches don't have this problem, because the bulk of the common search terms can be preprocessed for easy retrieval, and having data that's an hour or two old isn't a huge issue.
So you have this massive database (1 billion users, each with many different types of associated data - posts, images, videos, things they've liked, things they've shared, etc, etc), and each of those 1 billion users has an entirely different set of friends from which recent (basically real-time) data must be polled - over, and over, and over again, all day long. Now, throw in the very complex privacy rules, as to which types of posts can be seen by which types of friends, groups, block lists, etc, and the problem becomes very, very complex. Sure, most of us could bang out something with that core functionality without too much difficulty, but to make it work nearly real-time for 1 billion users at once? That's an incredible undertaking.
| }
|