Facebook is the world's largest social networking site, with over a billion users logging in at least once a month. These users upload more than 2.5 billion pieces of content daily. Supporting such a massive platform while continuously rolling out new features, how do Facebook engineers manage to achieve all this? Kent Beck, the creator of Extreme Programming and currently employed by Facebook, co-authored a recent paper that provides an in-depth look into Facebook's development and deployment processes.
It’s clear that Facebook engineers don’t follow the traditional waterfall model used in the software industry. Instead, they continuously develop new features and quickly deploy them, making these features immediately accessible to users. This process is commonly referred to as continuous deployment. In their view, Facebook’s development never truly ends; its codebase continues to grow exponentially. Currently, it exceeds 10 million lines of code, with 8.5 million being PHP code, showing a superlinear growth trend over time.
At Facebook, all frontend engineers work on the same stable branch, which accelerates development by eliminating the cumbersome process of merging branches. During everyday development, everyone uses Git locally, and when the code is ready, it gets pushed to SVN (for historical reasons). This naturally separates the code under development from the code ready for deployment.
However, to ensure the stability of the site, just pushing code to SVN doesn’t mean it will automatically go live. Facebook employs a balanced approach by combining daily and weekly releases. By default, all code changes are released weekly, with each release including a relatively large number of updates. On Sunday afternoons, the release engineer pushes the code to SVN, followed by extensive automated testing, including many correctness and performance regression tests. This version then becomes the default version used internally by Facebook employees, with the official release typically scheduled for Tuesday afternoons.
Release engineers score each engineer’s historical performance, known internally as "Push Karma." Engineers whose code frequently causes issues receive lower scores, meaning their code receives extra scrutiny. The purpose of this is to control release risks rather than judge individuals, so these scores remain confidential. Additionally, larger changes or code reviewed extensively during Code Review are considered higher risk and thus also receive extra attention.
Besides the weekly releases, there are two smaller releases every day on other weekdays, mostly involving non-critical updates or bug fixes. In extreme cases, more releases may occur, even on weekends.
Before being included in a release, the code has already undergone unit testing by the developer and a Code Review. At Facebook, Code Review is a crucial step, facilitated by a tool called Phabricator, which integrates with the version control system.
In addition to extensive automated testing, every employee using Facebook internally contributes to high-density testing. Each employee can report any issues they encounter, and with more developers contributing, the amount of code tested increases proportionally.
In terms of performance, Facebook uses Perflab to compare the performance of old and new code. If the new code performs poorly and the developer cannot fix it promptly, the relevant code is excluded from the current release until the issue is resolved. Even small performance issues are not overlooked because they can quickly accumulate into significant capacity and performance problems. Perflab visually presents system performance through charts.
For a website like Facebook, weekly releases are staged. First comes H1, where the code is deployed to servers accessible only internally for final testing, often referred to as "pre-release" by other companies. Next is H2, deploying to thousands of servers and opening access to a small group of users. If no issues arise in H2, the process moves to H3, deploying across all servers.
If issues are found during this process, engineers immediately fix them and restart the staged deployment. Alternatively, they can roll back the code, either by reverting specific changes and their dependent files or by rolling back the entire binary package.
Facebook has a vast number of servers distributed across four different geographical locations. The entire release package is approximately 1.5GB and takes around 20 minutes to distribute fully. To achieve this, BitTorrent is used during distribution, taking into account rack and cluster affinity. Since Twitter open-sourced their BitTorrent-based release solution Murder, using BitTorrent for releases has become an industry standard.
During releases, developers related to the changes must be online. Release engineers confirm this via an IRC bot, and if someone is unavailable, their changes are rolled back. This ensures issues are detected and fixed early in the release cycle. Given the complexity of such a large system, some issues might still be hard to detect promptly, so Facebook continuously monitors system health using internal tools like Claspin and external sources (such as Twitter).
Through the Gatekeeper system, engineers can easily control how many users have access to specific new features, filtering based on location or age. In case of issues, they can quickly disable access to certain features. With Gatekeeper, engineers can conveniently conduct A/B testing, collecting real user feedback to adjust the product accordingly. It's worth noting that at Facebook, engineers choose what they work on, preferring to build and test features directly with users rather than guessing user needs in meetings.
Kent Beck states in the article:
"Methodologies and tools alone are not enough, as they are always prone to misuse. Therefore, fostering a corporate culture that encourages personal responsibility is critical."
Currently, Facebook has about 1,000 development engineers but only three release engineers, with no dedicated test engineers. Every engineer can view all the code, submit patches, or detailed problem descriptions. Engineers are responsible for writing comprehensive unit tests, ensuring their code passes all regression tests and supports subsequent operations and maintenance tasks.
Besides being accountable for their own code, engineers face various significant challenges, often requiring extensive experimentation with multiple solutions. For example, to address PHP performance issues, three different solutions were developed simultaneously. When the lead of one solution realized another was better, they stopped their work. Ultimately, HipHop won, but the efforts of the other teams weren't wasted—they provided important backup capabilities.
The article concludes by mentioning Facebook’s Bootcamp training program for new hires. For more details on this, early Facebook employee Hui Wang describes it thoroughly in his article "Taming Your New Engineers - Discussing Bootcamp."