Facebook is headquartered in Menlo Park, California at a site that used belong to Sun Microsystems. A large sign with Facebook’s distinctive “like” symbol—a hand making the thumbs-up gesture—marks the entrance. When I arrived at the campus recently, a small knot of teenagers had congregated, snapping cell phone photos of one another in front of the sign.
Thanks to the film The Social Network, millions of people know the crazy story of Facebook’s rise from dorm room project to second largest website in the world. But few know the equally intriguing story about the engine humming beneath the social network’s hood: the sophisticated technical infrastructure that delivers an interactive Web experience to hundreds of millions of users every day.
I recently had a unique opportunity to visit Facebook headquarters and see that story in action. Facebook gave me an exclusive behind-the-scenes look at the process it uses to deploy new functionality. I watched first-hand as the company’s release engineers rolled out the new “timeline” feature for brand pages.
As I passed through the front entrance of the campus and onto the road that circles the buildings, I saw the name on a street sign: Hacker Way. As founder Mark Zuckerberg explained in an open letter to investors earlier this year when Facebook filed for its initial public offering, he also gave the name “The Hacker Way” to the company’s management philosophy and development approach. During my two days at Facebook, I learned about the important role that release engineering has played in making The Hacker Way scale alongside the site’s rapid growth in popularity.
The Menlo Park campus is a massive space, densely packed with buildings; it felt more like I was entering a tiny city than a corporate campus. Inside the buildings, tasteful graffiti-like murals and humorous posters decorate the walls. Instead of offices, Facebook developers work mostly in open spaces laid out like bullpens. Workstations are lined up along shared tables, with no barriers between individual workers.
I eventually reached the area where the release engineering team is headquartered. Like the rest of the development personnel, release engineering uses an open space at shared tables. But their space has a unique characteristic: a well-stocked bar.
The room initially had a partial wall between two vertical support pillars. When the release engineering team moved in, they converted the space into a bar with a countertop called the “hotfix bar,” a reference to critical software patches. They work at a table positioned alongside the bar.
That was where I met Chuck Rossi, the release engineering team’s leader. Rossi, whose workstation is conveniently located within arm’s reach of the hotfix bar’s plentiful supply of booze, is a software industry veteran who previously worked at Google and IBM. I spent a fascinating afternoon with Rossi and his team learning how they roll out Facebook updates—and why it’s important that they do so on a daily basis.
Facebook’s BitTorrent deployment system
The Facebook source code is largely written in the PHP programming language. PHP is conducive to rapid development, but it lacks the performance of lower-level languages and some more modern alternatives. In order to improve the scalability of its PHP-based infrastructure, Facebook developed a special transpiler calledHipHop.
HipHop converts PHP into heavily optimized C++ code, which can then be compiled into an efficient native binary. When Facebook unveiled HipHop to the public in 2010 and began distributing it under an open source software license, the company’s engineers reported that it reduced average CPU consumption on Facebook by roughly 50 percent.
Because Facebook’s entire code base is compiled down to a single binary executable, the company’s deployment process is quite different from what you’d normally expect in a PHP environment. Rossi told me that the binary, which represents the entire Facebook application, is approximately 1.5GB in size. When Facebook updates its code and generates a new build, the new binary has to be pushed to all of the company’s servers.
Moving a 1.5GB binary blob to countless servers is a non-trivial technical challenge. After exploring several solutions, Facebook came up with the idea of using BitTorrent, the popular peer-to-peer filesharing protocol. BitTorrent is very good at propagating large files over a large number of different servers.
Rossi explained that Facebook created its own custom BitTorrent tracker, which is designed so that individual servers in Facebook’s infrastructure will try to obtain slices from other servers that are on the same node or rack, thus reducing total latency.
Rolling out a Facebook update takes an average of 30 minutes—15 minutes to generate the binary executable and another 15 minutes to push the executable to most of Facebook’s servers via BitTorrent.
Facebook typically rolls out a minor update on every single business day. Major updates are issued once a week, generally on Tuesday afternoons. The release team is responsible for managing the deployment of those updates and ensuring that they are carried out successfully.
Frequent releases are an important part of Facebook’s development philosophy. During the company’s earliest days, the developers used rapid iteration and incremental engineering to continuously improve the website. That technical agility played a critical role in Facebook’s evolution, allowing it to advance quickly.
When Facebook recruited Rossi to head the release engineering team, he was tasked with finding ways to make sure that the company’s rapid development model would scale as the size and complexity of the Facebook website grew. Achieving that goal required some unconventional solutions, such as the BitTorrent deployment system.
During the time that I spent talking with Rossi, I got the impression that his approach to solving Facebook’s deployment problems is a balance of pragmatism and precision. He sets a high standard for quality and robustness, but aims for solutions that are flexible enough to accommodate the unexpected.
In some of our recent articles, we’ve written about the challenges and rewards of moving software applications to faster release cycles. One of the major challenges of operating at this speed is keeping the quality high; it leaves far less time for beta testing.
Quality testing poses a challenge for Facebook, which pushes out new changes every single day. To help spot problems, Facebook employees who access the social network from within the company’s internal network will always see an experimental build of the site based on the very latest code, including proposed changes that haven’t officially been accepted. When employees want to see the current production version of the website from within the network, they use a separate address.
Making the test site the default for employees ensures that pending features get more exposure before they are merged. The test site has some built-in bug reporting tools that make it easy for employees to supply feedback when they encounter issues.
Facebook also uses automated tests to avoid regressions and identify common issues. The company has two separate sets of these tests; one does some conventional sanity checking on the code and the other simulates user interaction to make sure that the website’s user interface behaves properly.
Prior to rolling out a full update, the new code first gets pushed to the “a2″ tier, a small number of public Facebook servers. This stage of the testing process exposes the update to many random Facebook users, but still just a fraction of the site’s total audience, and it gives Facebook’s engineers an opportunity to see how the update will perform in production.
Facebook hosts its own Internet relay chat (IRC) server for internal collaboration. Many of the company’s engineers idle on a main channel while they are working. According to Rossi, it typically has 700 people in it during the average work day. Facebook’s tool developers have created IRC bots that provide various kinds of functionality to integrate IRC into Facebook’s development and deployment workflow.
When Rossi is about to roll out an update, he initiates a checkin procedure on IRC. All of the developers who have submitted code for inclusion in the pending update are notified in the channel and have to respond to verify that they are present and ready for the update to go out.
When a developer doesn’t respond within a few minutes, Rossi can send a command to a bot that will attempt to get the developer’s attention through several different communication channels, including e-mail and text messages. As Rossi explained to me, he typically prefers to have all of the contributing developers on hand when deploying an update.
An important aspect of Facebook’s development culture is the idea that developers are fully responsible for how their code behaves in production. This philosophy mirrors the “DevOps” movement, which encourages lowering the wall between software development and IT operations.
If any of the code in a Facebook update causes problems in production, the developer who wrote it is on the hook for making sure that the issue gets resolved as quickly as possible.
Rossi’s workstation at Facebook consists of a 30-inch Dell display, a Mac laptop, and a secondary vertical monitor. During the Tuesday I spent with him, much of his work was done in a browser and in terminal windows. When he was ready to roll out the update, he issued a command in one of the terminals to begin the process.
We watched the status of the rollout in one of Facebook’s Web-based system monitoring tools. The webpage displayed a large progress bar showing the percentage of the company’s servers that had successfully been updated to the new binary. The progress bar moved forward as the automatic rollout proceeded. At the far left edge, a thin sliver of red appeared, representing the small number of systems that failed to pick up the new version.
Rossi said that he commonly sees a small number of systems fail to complete the update during deployment and that it’s usually caused by hardware issues. For example, a server might fail to update if its storage capacity is low or if it encounters a network issue while torrenting the file. The number of servers that fail is typically small enough to pose no difficulties.
While the software deployed to the servers, Rossi described how some characteristics of Facebook’s architecture impact the update process. Facebook is designed to be stateless and distributed, in the sense that the user’s session isn’t tied to any particular server. Any given page request can be handled by any of the servers in Facebook’s infrastructure.
That approach offers a lot of resilience. When Facebook performs an update, it doesn’t have to worry about serializing and migrating the state of user sessions. The deployment system restarts the Facebook executable process on the servers in waves as they receive the update. The servers that are already finished or still running the old version can continue handling incoming page requests while the update is being rolled out across the company’s infrastructure.
Facebook continues operating at nearly full capacity during an update. A typical Facebook deployment doesn’t require scheduled downtime or cause any other disruption to the website. Rossi said that the no-downtime update is an important requirement of the Facebook release strategy. He also views it as a hallmark of quality Web software engineering.
After the update completes, Rossi looks at various aspects of the system to make sure that the changes didn’t break anything. His team has access to a sophisticated set of analytics tools that they use to track Facebook’s status. The main dashboard shows a multitude of line graphs that show changes in traffic, resource consumption, error rates in individual segments of the product, and many other relevant metrics.
Watching the fluctuation of those vital signs helps Facebook identify problems in the system. Comparing against historical data makes it easier to pinpoint exactly when a problem began to occur. The release team and other Facebook engineers pay particularly close attention to the site’s status after an update to make sure that there are no anomalies.
If a problem is detected, such as an unexpectedly high error rate from some part of the system, the company’s engineers can dig into the error logs to see exactly what’s going on. Facebook’s internal tools for viewing and analyzing error logs make it easy for the user to see what code changes are associated with a particular error message.
The many data sources tracked by Facebook’s internal monitoring tools even include tweets about Facebook. That information is displayed in a graph with separate trend lines to show the change in volume of positive and negative remarks. This is useful, since one of the things that people do when they encounter a technical problem on a social network is complain about it on a different social network.
The update that I observed went smoothly; no technical problems or bugs emerged after the rollout. The graphs showed a minor spike in log messages from one system component, but it ended up being a non-issue after Rossi’s team tracked down the source.
Reverting is for losers
Although there were no fires to put out while I was there, Rossi indulged my curiosity by describing how Facebook responds when an update doesn’t go smoothly. If a serious bug gets detected after an update, the release engineering team works with the relevant developers to resolve the problem as quickly as possible. When a fix is ready, Rossi’s team will spin up and roll out a new update.
I asked him if he ever has to revert to a previous version of the site when there are bugs that can’t easily be fixed. “Reverting is for losers!” he replied.
He went on to explain that he does, in fact, have a mechanism in place for reverting to a previous version, but it’s only used as a measure of last resort. The servers retain previous versions of the Facebook binary and can be made to switch back to those if it’s absolutely necessary.
He said that rolling back to a previous version of Facebook is a bit like yanking the emergency stop handle in a train; it’s undesirable and seldom done. In the years that he has been at Facebook, he’s only had to do it a few times.
Facebook’s testing practices and culture of developer accountability help to prevent serious bugs from being rolled out in production code. When a developer’s code disrupts the website and necessitates a post-deployment fix, the incident is tracked and factored into Facebook’s assessment of the developer’s job performance.
The company’s internal tools have a Facebook-inspired mechanism that Rossi uses to keep score. Facebook’s developers all have a “karma” rating that is tracked through the code review system. Rossi can increase or decrease a developer’s karma by clicking on thumbs-up and thumbs-down icons that appear next to the developer’s name in a Web-based dashboard.
The thumbs-up icon in Rossi’s tool is the same one used for the “like” function on the social networking site. The thumbs-down image is the same icon, but upside down. When Rossi showed me the icons, he joked that he’s the only person in the world who has a Facebook “dislike” button.
The karma scores help Facebook identify employees who are struggling, but the scores are also useful during the code review process. When Rossi sees a merge proposal from an engineer with a low karma score, he will know at a glance that accepting code from that developer potentially poses a higher risk.
Employees with low karma can regain their lost points over time by performing well—though some also try to help their odds by bringing Rossi goodies. Booze and cupcakes are Rossi’s preferred currency of redemption; the release engineering team has an impressive supply of booze on hand, some of which was supplied by developers looking to restore their tarnished karma.
I spoke with Rossi about his vision for how Facebook’s deployment strategy will change as the company’s technical infrastructure evolves. He said that future developments will enable his team to dramatically accelerate the rollout procedure, reducing the total build and deploy time to a fraction of the current 30 minutes.
One of the major ongoing development efforts at Facebook is a project to replace the HipHop transpiler. Facebook’s developers are creating their own bytecode format and custom runtime environment, called the HipHop virtual machine, to power the next generation of the Facebook platform. With this project finished, the company will be able to compile its PHP source into bytecode that will be executed by the virtual machine.
Transitioning to a managed code model, similar to that of Java and .NET, will give Facebook more flexibility across the board. In addition to offering many other advantages, Rossi explained that it will have significant implications for the deployment process. Instead of having to push a 1.5GB binary to all of the servers, the company can push thin bytecode deltas representing just the parts that have changed. Facebook may even be able to splice the updated bytecode into the application while it’s running, avoiding a process restart.
When it’s possible to push an update in mere minutes without the need for a large rollout process, it will be possible for Facebook to abandon its typical update schedule and move to a model where changes are rolled out into deployment incrementally, as they are developed. That approach would allow the company’s developers to be even more agile than they are now.
After the Tuesday update process finished and Rossi’s team analyzed the system to make sure that the update hadn’t introduced any problems, they celebrated by downing a few drinks at the hotfix bar.
As I left the Facebook campus at the end of the day and strolled past the Hacker Way sign again, I reflected on the significant role that something as invisible as “release engineering” plays in bringing Facebook to the masses.
Facebook’s transition to the timeline profile layout will increase the social network’s emphasis on providing a platform for users to share experiences and to document their personal narratives. The technical infrastructure that powers those capabilities has a story of its own, and an identity tied to Facebook’s unique developer culture.