Category Archives: Cloud-Common

[repost ]5 Ways To Make Cloud Failure Not An Option


With cloud SLAs generally being worth what you don’t pay for them, what can you do to protect yourself? Sean Hull in AirBNB didn’t have to fail  has some solid advice on how to deal with outages:

  1. Use Redundancy. Make database and webserver tiers redundant using multi-az or alternately read-replicas.
  2. Have a browsing only mode. Give users a read-only version of your site. Users may not even notice failures as they will only see problems when they need to perform a write operation.
  3. Web Applications need Feature Flags. Build in the ability to turn off and on major parts of your site and flip the switch when problems arise.
  4. Consider Netflix’s Simian. By randomly causing outages in your application you can continually test your failover and redundancy infrastructure.
  5. Use multiple clouds. Use Redundant Arrays of Inexpensive Clouds as a way of surviving outages in any one particular cloud.

None of these are easy and it’s worth considering that your application may not need them at all. Life will almost always go on anyway.

Sean has many more details in AirBNB didn’t have to fail.

[repost ]Are We Seeing The Renaissance Of Enterprises In The Cloud?


A series of recent surveys on the subject seems to indicate that this is indeed the case:

Research conducted by HPclip_image001 found that the majority of businesses in the EMEA region are planning to move their mission-critical apps to the cloud. Of the 940 respondents, 80 percent revealed plans to move mission-critical apps at some point over the next two to five years.

A more recent survey, by research firm MeriTalkclip_image001[1] and sponsored by VMware and EMC (NYSE:EMCclip_image001[2]), showed that one-third of respondents say they plan to move some mission-critical applications to the cloud in the next year. Within two years, the IT managers said they will move 26 percent of their mission-critical apps to the cloud, and in five years, they expect 44 percent of their mission-critical apps to run in the cloud.

The Challenge – How to Bring Hundreds of Enterprise Apps to the Cloud

The reality is that cloud economics only start making sense when there are true workloads that utilize the cloud infrastructure.

If the large majority of your apps fall outside of this category, then you’re not going to benefit much from the cloud. In fact, you’re probably going to lose money, rather than save money.

The Current Approach

  • Focus on building IaaS – Current cloud strategies of many enterprises has been centered on making the infrastructure cloud ready. This basically means ensuring that they are able to spawn machines more easily than they were before. A quick look at many initiatives of this nature shows that there is still only a small portion of enterprises whose applications run on such new systems.
  • Build a new PaaS – PaaS has been taught as the answer to run apps on the cloud. The reality however, is that most of the existing PaaS solutions only cater to new apps and quite often the small, and “non” mission-critical share of our enterprise applications, which still leaves the majority of our enterprise workload outside of our cloud infrastructure.
  • App Migration as a One Off Project - The other approach for migrating applications to the cloud has been to select a small group of applications, and then migrate these one by one to the cloud. Quite often the thought behind this approach has been that application migration is a one-off project. The reality is that applications are more of a living organism – things fail, are moved, or need to be added and removed over time. Therefore it’s not enough to move apps to the cloud using some sort of virtualization technique, it’s critical that the way they’re run and maintained will also fit the dynamic nature of the cloud.

Why is This not Going to Work?

Simple math shows that if you apply this model to the rest of your apps, it’s probably going to take years of effort to migrate all your apps to the cloud. The cost of doing so is going to be extremely high, not to mention the time to market issue which can be even an even greater risk in the end, as it will reflect on cost of operation, profit margins and even the ability to survive in this an extremely competitive market, if it is too long.

What’s missing?

What we’re missing is a simple and systematic way to brings all these hundreds and thousands of apps to the cloud.

Moving Enterprise Workloads to the Cloud at a Massive Scale

Instead of thinking of cloud migration as a one-off thing, we need to think of cloud migration on a massive scale.

Thinking in such terms drives a fairly different approach.

In this post, I outlined what i believe should be the main principles for moving enterprise application at such a scale.

Read full post:


[repost ]点评阿里云盛大云代表的云计算IaaS产业






1 IaaS回顾

1.1 云计算在中国有多久

现在是2012年8月底9月初,要问中国的IaaS走过了多少时间了?2年?3年?不对,是差不多4年半的时间。总有一些公司或者在公开场合或者是私下场合说,我们公司做了5年6年云计算,也总有一些猎头跟我说我们要找资深的云计算业内专家,好吧,如果这些人说的是SaaS的话,我承认,可以有。可事实上他们都说的是云计算特别是IaaS,我要说,真没有。如果非要说云计算这个词IaaS这个词还没有产生还没有传入中国,就有人开始做了,那这蛋扯的太远。事实上,Cloud Computing这个单词在2006年之前在英文中并不存在。2006年前后,Cloud Computing这个单词开始偶尔出现。2007年末,Cloud Computing出现的频率迅速增加。2008年初,Cloud Computing在中文中开始被翻译为“云计算”2008年开始。2008年上半年,中国人搞懂云计算这个单词意思的不超过10个。

1.2 谁先开始干云计算和IaaS的


1.2.1 还有谁都参与进来了









2 IaaS现状


2.1 以阿里云盛大云万网云为代表的第一阵营


2.1.1 阿里云















2.1.2 盛大云










2.1.3 万网









2.2 以LinkCloud西部数码华云为代表的第二阵营

首先要说明的是,第二阵营并不意味着这些企业没有成绩或者产品差,身处第二阵营的主要不利因素是:1 可继承的基础差;2 现阶段获得的客户数量相比第一阵营少。其中第一个因素是决定性因素,通过分析这些IaaS厂家背后的实力、投入、可继承的客户技术和影响力、产品的性价比、获得的客户数量,我发现其实第二正营的投入产出比目前比第一阵营大。所以他们放入第二阵营真正的原因是可继承的基础差,没有庞大的财力、市场知名度、存量客户为基础,尽管用有效的投入取得了相对不错的产出,总体实力和影响力仍然在市场上处于二线阵营。第一第二阵营的排名与百度云主机自然排名是非常接近的。


2.2.1 LinkCloud




2.2.2 西部数码



2.2.3 华云



2.3 以ViaCloud和太平洋电信为代表的第三阵营



其他诸如互易中国、云派等数十家IDC服务商有云主机服务,还有一个瑞豪开源做xen vps好久了。


2.4 以华为为代表的第四阵营


2.4.1 华为


2.4.2 中国电信


2.4.3 百度


2.4.4 腾讯


2.4.5 其他


  • 网易


  • 京东


  • 完美时空


  •   360


  • 三大运营商

三大运营商,电信联通移动,都有一支不小的云计算研究和标准团队,主要偏向内部私有云,同时为包括IaaS PaaS SaaS在内的公有云服务大平台做准备。目前,至少电信已经成立了独立的云计算事业部或者子公司。

  • 金山


  •   ezCloud


  • Ucloud



3 PaaS


就国外来说,最具特色的要数Joyent的服务。其平台都采用开源技术构建,提供了一个优雅的开发部署环境。奈何,其Solaris出身的不幸,以及Ruby和Python并未占据更大的开发者份额的现状,阻碍了其发展的想象空间。Google App Engine 和 微软Azure 确实是很知名的了,但GAE一直接受度不高,Azure不但加入IaaS特性及微软体系以外的开源组件,显示了他们现阶段的无力。


就国内来说,无论是推出有一段时间的新浪SAE,还是阿里云ACE和盛大云云引擎,其赢得的客户数量和市场影响力知名度,在IaaS服务面前,都是可以忽略的。在新浪SAE web服务器更独立、百度推出虚拟机之前,尽管新浪和百度都用其所有资源支持站长利用这些PaaS服务建立应用,这些PaaS难以建立实际位置。但是,我欢迎这种新产品的尝试,创新和革新有时候就隐藏在这些不被主流认可的技术和模式中。

4 IaaS预测


  • PaaS在5年内不能成为主流。


  • 纯PaaS服务将向IaaS/PaaS混合体发展


  • 阿里云在2013年上半年面临众IDC企业的冲击和蚕食


  • 阿里云或在三年内面临合并与解散的可能性


  •  万网或3年内合并阿里云


  • 盛大云在2013年下半年面临解散的危险


  • 华为云服务或3个月内上线


  • 网易云夭折于上线


  • 华为云百度云腾讯云网易云将昙花一现


  • IDC厂商将在2013年加速进入IaaS


  • 运营商公有云IaaS服务或继续酝酿一年


  • 世纪互联或联合微软推出公有云


原文来自 汉唐月


[repost ]虚拟化、云计算、开放源代码及其他





虚拟化的也有不同的层次,例如硬件层面的虚拟化和软件层面的虚拟化。硬件虚拟化指的是通过模拟硬件的方式获得一个类似于真实计算机的环境,可以运行一个完整的操作系统。在硬件虚拟化这个层面,又有Full Virtualization(全虚拟化,几乎是完整地模拟一套真实的硬件设备。大部分操作系统无须进行任何修改即可直接运行在全虚拟化环境中。)、Partial Virtualization(部分虚拟化,仅仅提供了对关键性计算组件或者指令集的模拟。操作系统可能需要做某些修改才能够运行在部分虚拟化环境中。)和Paravirtualization(半虚拟化,不对硬件设备进行模拟,虚拟机拥有独立的运行环境,通过虚拟机管理程序共享底层的硬件资源。大部分操作系统需要进行修改才能够运行在半虚拟化环境中。)等不同的实现方式。软件层面的虚拟化,往往是指在同一个操作系统实例的基础上提供多个隔离的虚拟运行环境,也常常被称为容器技术。


在虚拟化这个领域,国内的公司对硬件虚拟化的兴趣较大,在研发和生产环境中也大都采用硬件虚拟化技术。淘宝是国内较早地研究并应用软件虚拟化技术的,他们在淘宝主站的实践经验表明使用cgroup替代Xen能够提升资源利用率。至于在一个实际的应用场景中到底应该选择硬件虚拟化还是软件虚拟化,则应该重点考虑最终用户是否需要对操作系统的完全控制权(例如升级内核版本)。如果最终用户仅仅需要对运行环境的控制权(例如PaaS层面的各种App Engine服务),软件虚拟化可能性价比更高。对于为同一应用提供横向扩展能力的应用场景,软件虚拟化也是比较好的选择。

对于需要深入了解虚拟化技术的技术人员来说,VMWare发表的白皮书《Understanding Full Virtualization, Paravirtualization, and Hardware Assist》是一份很好的参考资料。



早期的虚拟化技术,解决的是在同一台物理机上提供多个相互独立的运行环境的问题。当需要管理的物理机数量较小时,系统管理员可以手动登录到不同的物理机上进行虚拟机生命周期管理(资源配置、启动、关闭等等)。当需要管理的物理机数量较大时,就需要写一些脚本/程序来提高虚拟机生命周期管理的自动化程度。以管理和调度大量物理/虚拟计算资源为目的软件,称为虚拟化管理工具。虚拟化管理工具使得系统管理员可以从同一个位置执行如下任务:(1)对不同物理机上的虚拟机进行生命周期管理;(2)对所有的物理机和虚拟机进行查询甚至监控;(3)建立虚拟机命名与虚拟机实例直接的映射关系,使得虚拟机的识别和管理更加容易。Linux操作系统上的VirtManager是一个简单的虚拟化管理工具。在VMWare产品家族中,VMWare vSphere是一个功能强大的虚拟化管理工具。



在数据中心的层面,系统管理员需要面对大量不同类型的硬件和应用。与小型的集群相比较,数据中心的系统复杂度大大提高了。这时简单的虚拟化管理工具已经无法满足系统管理员的要求,因此在虚拟化管理工具的基础上又发展出各种数据中心虚拟化管理系统。在硬件层面,数据中心虚拟化管理系统通过划分资源池(一个资源池通常是一个集群)的方式对硬件资源进行重新组织,并以虚拟基础构架(Virtual Infrastructure)的方式将计算资源暴露给用户。在软件层面,数据中心虚拟化管理系统引入系统管理员和普通用户两种不同的角色,甚至是基于应用场景的需要设定颗粒度更细的基于角色的权限控制(Role Based Access Control,RBAC)。系统管理员对整个数据中心的物理机和虚拟机拥有管理权限,但是一般不对正常的虚拟机进行干涉。普通用户只能在自己具有权限的资源池内进行虚拟机生命周期管理操作,不具有控制物理机的权限。在极端的情况下,普通用户只能够看到分配给自己的资源池,而不了解组成该资源池物理机细节。



现代的数据中心虚拟化管理系统,往往提供了大量有助于运维自动化的功能。这些功能包括 (1)基于模板快速部署一系列相同或者是相似的运行环境;(2)监控、报表、预警、会计功能;和(3)高可用性、动态负载均衡、备份与恢复等等。一些相对开放的数据中心虚拟化管理系统,甚至以开放API的方式使得系统管理员能够根据自身的应用场景和流程开发额外的扩展功能。

在VMWare产品家族中,VMWare vCenter是一个数据中心虚拟化管理软件。其他值得推荐的数据中心虚拟化管理软件包括Convirt、XenServer、Oracle VM、OpenQRM等等。


云计算是对数据中心虚拟化的进一步封装。在云计算管理软件中,同样需要有云管理员和普通用户两种(甚至更多)不同的角色以及不同的权限。管理员对整个数据中心的物理机和虚拟机拥有管理权限,但是一般不对正常的虚拟机进行干涉。普通用户可以通过浏览器自助地进行虚拟机生命周期管理 ,也可以编写程序通过Web Service自动地进行虚拟机生命周期管理。

在云计算这个层面,虚拟机生命周期管理的权限被彻底下放真正的普通用户,但是也将资源池和物理机等等概念从普通用户的视野中屏蔽了。普通用户可以获得计算资源,但是无需对其背后的物理资源有任何了解。从表面看,云计算似乎就是以与Amazon EC2/S3相兼容的模式提供计算资源。在实质上,云计算是计算资源管理的模式发生了改变,最终用户不再需要系统管理员的帮助即可自助地获得获得和管理计算资源。

对于云管理员来说,将虚拟机生命周期管理权限下放到最终用户并没有降低其工作压力。相反,他有了更加令人头疼的事情需要去处理。在传统的IT架构中,往往 是一个应用配备一套计算资源,应用之间存在物理隔离,问题诊断也相对容易。升级到云计算模式之后,多个应用可能共享同一套计算资源,应用之间存在资源竞 争,问题诊断就相对困难。因此,云管理员往往希望选用的云计算管理软件能够有相对全面的数据中心虚拟化管理功能。对于云管理员来说,至关重要的功能包括 (1)监控、报表、预警、会计功能;(2)高可用性、动态负载均衡、备份与恢复等等;和(3)动态迁移,可以用于局部负载调整以及故障诊断。


在VMWare产品家族中,VMWare vCloud是一个云计算管理软件。其他值得推荐的云计算管理软件包括OpenStack、OpenNebula、Eucalyptus和CloudStack。虽然OpenStack、OpenNebula、Eucalyptus和CloudStack都是云计算管理软件,但是其功能有较大的差别,这些差异源于不同 的软件具有不同的设计理念。OpenNebula和CloudStack最初的设计目标是数据中心虚拟化管理软件,因此具有比较全面的数据中心虚拟化管理 功能。云计算的概念兴起之后,OpenNebula增加了OCCI和EC2接口,CloudStack则提供了称为CloudBridge的额外组件 (CloudStack从 4.0版本开始缺省地包含了CloudBridge组件),从而实现了与Amazon EC2的兼容。Eucalyptus和OpenStack则是以Amazon EC2为原型自上而下地设计成云计算管理软件的,从一开始就考虑与Amazon EC2的兼容性(OpenStack还增加了自己的扩展),但是在数据中心虚拟化管理方面的功能尚有所欠缺。在这两者当中,Eucalyptus项目由于起步较早,在数据中心虚拟化管理方面的功能明显强于OpenStack项目。


如D 所述的云计算,仅仅是一种狭义上的云计算,或者是与Amazon EC2相类似的云计算。 广义上的云计算,可以泛指是指各种通过网络访问物理/虚拟计算机并利用其计算资源的实践,包括如D 所述的云计算和如C 所述的数据中心虚拟化。这两者的共同点在于云计算服务提供商以虚拟机的方式向用户提供计算资源,用户无须了解虚拟机背后实际的物理资源状况。如果某个云平台仅对某个集团内部提供服务,那么这个云平台也可以被称为“私有云”;如果某个云平台对公众提供服务,那么这个云平台也可以被称为“公有云”。一般来说,私有云服务于集团内部的不同部门(或者应用),强调虚拟资源调度的灵活性(例如最终用户能够指定虚拟机的处理器、内存和硬盘配置);公有云服务于公众,强调虚拟资源的标准性(例如公有云服务提供商仅提供有限的几个虚拟机产品型号,每个虚拟机产品型号的处理器、内存和硬盘配置是固定的,最终用户只能够选择与自身需求最为接近的虚拟机产品型号)。

对于公有云服务提供商来说,其业务模式与Amazon EC2相类似。因此,公有云服务提供商通常应该选择如D 所述的云计算管理软件。对于私有云服务提供商来说,则应该根据集团内部计算资源的管理模式来决定选用的软件。如果对计算资源进行集中式管理,仅仅将虚拟机生命周期管理的权限下放到部门经理或者是团队负责人这个级别,那么就应该选择如C 所述的数据中心虚拟化管理系统。如果要将虚拟机生命周期管理的权限下放到真正需要计算资源的最终用户,则应该选择如D 所述的云计算管理软件。

传统上,人们认为私有云是建立在企业内部数据中心和自有硬件的基础上的。但是硬件厂商加入云计算服务提供商的行列之后,私有云与公有云之间的界限变得越来越模糊。Rackspace推出的私有云服务,客户可以选择使用自有的数据中心和硬件,也可以选择租用Rackspace的数据中心和硬件。Oracle最近更进一步提出了“由Oracle拥有并管理”( Owned by Oracle, Managed by Oracle)的私有云服务。在这种新的业务模式下,客户所独享的私有云是仅仅是云服务提供商的公有云当中与其他客户相对隔离的一个资源池(you got private cloud in my public cloud)。而对于云服务提供商来说,用于提供公有云服务的基础构架可能仅仅是其自有基础构架(私有云)中的一个资源池,甚至是硬件厂商自有基础构架(私有云)中的一个资源池(you got public cloud in my private cloud)。

对于客户来说,使用基于云服务提供商的数据中心和硬件的私有云服务在财务上是合理的。这样做意味着自建数据中心和采购硬件设备的固定资产投入(CapEX)变成了分期付款的运营费用(OPEX),宝贵的现金则可以作为用于拓展业务的周转资金。即使长期下来拥有此类私有云的总体费用比自建数据中心和采购硬件设备要高,但是利用多出来的现金进行业务拓展所带来的回报可能会超过两个方案之间的费用差额。在极端的情况下,即使企业最终没有获得成功,也无需心疼新近购置的一大堆硬件设备。除非是房地产市场在短时间内有较大的起色,一家濒临倒闭的公司通常是不会为没有自建一个数据中心而感到后悔的。(需要指出的是,对于一家能够长时间运作的公司来说,通过房地产来盈利是完全有可能的。在Sun 公司被Oracle公司收购之前,就曾经通过变卖祖业的方式使得财报扭亏为盈。)


1865年,英国经济学家威廉杰文斯(Willian Jevons,1835-1882)写了一本名为《煤矿问题》(The Coal Question)的书。杰文斯描述了一个似乎自相矛盾的现象:蒸汽机效率方面的进步提高了煤的能源转换率,能源转换率的提高导致了能源价格降低,能源价格的降低又进一步导致了煤消费量的增加。这种现象称为杰文斯悖论,其核心思想是资源利用率的提高导致价格降低,最终会增加资源的使用量。在过去150年当中,杰文斯悖论在主要的工业原料、交通、能源、食品工业等多个领域都得到了实证。

公共云计算服务的核心价值,是将服务器、存储、网络等等硬件设备从自行采购的固定资产变成了按量计费的公共资源。虚拟化技术提高了计算资源的利用率,导致了计算资源价格的降低,最终会增加计算资源的使用量。明白了这个逻辑,就能够明白为什么HP会果断加入OpenStack的阵营并在OpenStack尚未成熟的情况下率先推出基于基于OpenStack的公有云服务。固然,做云计算不一定能够拯救HP于摇摇欲坠之中,但是如果不做云计算,HP恐怕就时日不多了。同样,明白了这个逻辑,就能够明白为什么Oracle会从对云计算嗤之以鼻摇身一变称为云计算的实践者。收购了Sun 公司之后,Oracle一夜之间变成了世界领先的硬件提供商。当时云计算的概念刚刚兴起,Oracle不以为然的态度说明它尚未充分适应自身地位的变化。如今云计算已经从概念炒作进入实战演习阶段,作为主要硬件厂商之一的Oracle如果不打算从云计算中分一杯羹的话,那就是真正的反射弧过长了。


目前,大部分公有云服务提供商的虚拟机产品都是按照配置定价的。以Amazon EC2为例,其中型(Medium)虚拟机(3.75 GB内存,2 ECU计算单元,410 GB存储,0.16美元每小时)的配置是小型(Small)虚拟机(1.7 GB内存,1 ECU计算单元,160 GB存储,0.08美元每小时)的两倍,其价格也是小型虚拟机的两倍。新近推出的HP Cloud Services,以及国内的盛大云和阿里云,基本上都照搬Amazon EC2的定价方法。问题在于,虚拟机的配置提高之后,虚拟机的性能并没有得到同比提高。一系列针对Amazon EC2、HP Cloud Services、盛大云和阿里云的性能测试结果表明,对于多种类型的应用来说,随着虚拟机配置的提高,其性价比实际上是不断降低的。这样的定价策略,显然不能达到鼓励用户使用更多计算资源的目的。



近些年来,我们在信息技术领域观察到一个规律。当一个闭源的解决方案在市场上取得成功时,很快就会出现一个甚至是多个提供类似功能(或者服务)的开源或者闭源的追随者。(首先出现开源软件,然后出现与之竞争的闭源软件的案例比较少见。)在操作系统领域,Linux逐渐达到甚至是超越了Unix的技术水平,进而取代Unix的市场地位。在虚拟化领域,Xen和KVM紧紧跟随VMWare的技术发展并有所突破,逐步蚕食VMware的市场份额。在云计算领域,Enomaly率先推出了以Amazon EC2为蓝本的闭源解决方案,紧跟着又出现了以Eucalyptus和OpenStack为代表的开源解决方案。与此同时,传统意义上的闭源厂商对开源项目和社区的态度也在发生转变。例如,多年来对开源项目持敌视态度的微软于今年四月组建了一家名为“微软开放技术”(Microsoft Open Technologies)的子公司,其目标是推进微软向开放领域的投资,包括互操作性、开放标准和开源软件。

我们今天所处的商业环境,与上个世纪80年代自由软件运动(Free Software Movement)刚刚兴起的时候已经有了较大的不同。自1998年NetScape第一次提出开放源代码(Open Source)这个术语起,开放源代码就已经成为一种新的软件研发、推广与销售模式,而不再是与商业软件相对立的替代品了。与传统的闭源软件商业模式相对比,基于开放源代码的商业模式具有如下特点:



(3)在项目收割阶段,项目发起者以及主要合作伙伴可以通过销售增强版本或者是提供解决方案获取财务回报。虽然其他厂商也可以提供类似的产品或者服务,但是开源项目的主要参与者往往在市场上拥有更大的话语权和权威性。关于开源项目的盈利问题,Marten Mickos(Eucalyptus的CEO)在担任MySQL公司CEO期间曾指出:“如果要在开源软件上取得成功,那么你需要服务于:(A)愿意花费时间来省钱的人;和(B)愿意花钱来节约时间的人。”如果说一个公司在开源方面取得了成功,那么它从开源软件的销售和服务方面获得的回报至少应该大于在研发和推广方面的投入。显而易见,某些用户之所以能够免费使用开源软件,一方面固然是因为他们的参与降低了开源软件在研发和推广方面的投入,另一方面则是因为付费用户为开源软件付出了更多的钱。



开放源代码作为一种新的商业模式,并不比传统的闭源模式具有更高的道德水准。同理,在道德层面上对不同的开放源代码实践进行评判也是不合适的。在OpenStack项目的萌芽阶段,Rackspace公司的宣传文案声称OpenStack是“世界上唯一真正开放源代码的IaaS系统”。CloudStack、Eucalyptus和OpenNebula等具有类似功能的开源项目由于保留了部分闭源的企业版(2012年4 月以前,CloudStack项目和Eucalyptus均同时发布完全开源的社区版和部分闭源的企业版。2012年4 月之后,Eucalyptus项目宣布全面开源,CloudStack项目被Citrix收购并捐赠给Apache基金会后也全面开源。)、或者是仅向付费客户提供的自动化安装包(OpenNebula Pro是一个包含了增强功能的自动化安装包,但是其全部组件都是开放源代码的。)而被Rackspace归类为“不是真正的开放源代码项目”。类似的宣传持续了接近两年时间,直到Rackspace公司推出了基于OpenStack项目的Rackspace Private Cloud软件 — 一个性质上与OpenNebula Pro类似的自动化包。OpenNebula Pro是一个仅向付费用户提供的软件包,但是任何用户都可以免费地下载与使用Rackspace Private Cloud软件。问题在于,当用户所管理的节点数量超过20台服务器时,就需要向Rackspace公司寻求帮助(购买必要的技术支持)。这里我们暂且不讨论将节点数量限制为20台服务器这部分代码是否开源的问题。开源项目的发起者和主要贡献者在其重新打包的发行版中添加了限制该软件应用范围的功能,从道德层面来看很难解释,但是在商业层面来看就很正常。在过去两年中,OpenStack项目在研发、推广、社区等领域所采取的种种措施,都堪称是基于开放源代码的商业模式的经典案例。

前面我们提到,在同一领域往往存在多个相互竞争的开源项目。以广义上的云计算为例,除了我们熟悉的CloudStack、Eucalyptus、OpenNebula、OpenStack之外,还有Convirt、XenServer、Oracle VM、OpenQRM等等诸多选择。针对一个特定的应用场景,如何在众多的开源方案中进行选型呢?根据我个人的经验,可以将整个方案选型过程分为需求分析、技术分析、商务分析三个阶段。

(1)在需求分析阶段,针对特定的应用场景深入挖掘该项目采用云计算技术的真正目的。在中国,很多项目决策者对云计算的认识往往停留在“提高资源利用率、降低运维成本、提供更多便利”的阶段,并没有意识到这个列表已经是大部分开源软件均可提供的基本功能。除此之外,很多项目决策者缺省地将VMWare vCenter提供的全部功能作为对开源软件的要求,而没有考虑特定项目是否需要这些功能。因此,非常有必要针对特定的应用场景进行调研,明确将其按照数据中心虚拟化和狭义上的云计算归类,并进一步挖掘项目在功能上的具体要求。在很多情况下,数据中心虚拟化和狭义上的云计算均能够满足客户的总体需求,那么销售的任务就是将客户的具体需求往有利于自身的方向上引导。这个技巧,我们称之为客户期望值管理(Expectation Management)。通过需求分析,明确特定应用场景的分类,可以过滤掉一部分选项。

(2)在技术分析阶段,首先比较各个开源软件的参考架构,重点考虑在特定应用场景下按照参考构架进行实施所面临的困难。其次在功能的层面对各个开源软件进行对比,并将必须具备的功能(Must Have)和能够加分的功能(Good to Have)区别对待。除此之外,还可以对安装配置的难易程度、具体功能的易用性、参考文档的完备性、二次开发的可能性等等进行评估。通过技术分析,可以给各个开源软件打分排名,在此基础上可以淘汰掉得分最低的选项。


在中国(狭义上)的云计算市场, 对于愿意付费的客户来说,CloudStack和Eucalyptus是值得优先考虑的选项。这两个项目的启动时间比较早,具有更好的稳定性和可靠性,在业界有较大的影响力,并且在国内有团队可以提供支持和服务。与此同时,国内一些创业团队开始提供基于OpenStack的解决方案,但是在短时间内很难积累必要的实战经验,而具备丰富经验的新浪SAE团队尚未开拓对外提供技术支持的业务。国内虽然也有一些单位在使用OpenNebula,但是在近期内很难形成对第三方提供技术服务的能力。对于愿意花时间的客户来说,CloudStack和OpenStack的优势较为明显,因为两者的社区活跃度相对较高。在这两者当中,CloudStack的功能更加丰富,也有更多的企业级客户以及成功案例,可能是短期内的更佳选择。从长远来看,基于OpenStack的解决方案会越来越流行,但是其他解决方案在技术和市场上也都在不断取得进步,因此在未来三年内很难形成一统天下的局面。单纯从商业上考虑,CloudStack和Eucalyptus获得成功的几率可能会更大一些。



关于不同开源项目的社区活跃度比较,可以参考我最近的一篇博客文章《CY12-Q3 OpenStack, OpenNebula,Eucalyptus,CloudStack社区活跃度比较》。另外,我在《HP Cloud Services性能测试》一文中,也初步提出了一个对公有云进行性能评测的方法。


[repost ]Startups Are Creating A New System Of The World For IT


It remains that, from the same principles, I now demonstrate the frame of the System of the World. — Isaac Newton

The practice of IT reminds me a lot of the practice of science before Isaac Newton. Aristotelianism was dead, but there was nothing to replace it. Then Newton came along, created a scientific revolution with his System of the World. And everything changed. That was New System of the World number one.

New System of the World number two was written about by the incomparable Neal Stephenson in his incredible Baroque Cycle series. It explores the singular creation of a new way of organizing society grounded in new modes of thought in business, religion, politics, and science. Our modern world emerged Enlightened as it could from this roiling cauldron of forces.

In IT we may have had a Leonardo da Vinci or even a Galileo, but we’ve never had our Newton. Maybe we don’t need a towering genius to make everything clear? For years startups, like the frenetically inventive age of the 17th and 18th centuries, have been creating a New System of the World for IT from a mix of ideas that many thought crazy at first, but have turned out to be the founding principles underlying our modern world of IT.

If you haven’t guessed it yet, I’m going to make the case that the New System of the World for IT is that much over hyped word: cloud. I hope to show, using many real examples from real startups, that the cloud is built on a powerful system of ideas and technologies that make it a superior model for delivering IT.

IT has had an explosion of creativity: open source, deep and powerful tool chains, lean and agile development, cloud computing, virtualization, BigData, parallel programming, distributed monitoring, distributed programming, NoSQL, cost driven programming, dynamic languages, real-time processing, asynchronous programming, distributed teams, mobile platforms, viral loops, flat networks, software defined networking, wimpy cores, DevOps, everything as a service, infrastructure as code, and so on and so on. Astounding innovation wherever you look.

We are just now figuring out what new structures and systems are replacing the old, but if you step back a bit, what seems to be happening is we are creating a new “frame” using a bottom up methodology that just may be a new System of the World for IT. What is merging is a new way of working synthesised from all the diverse forces catalogued above. We’ve created a sort of new physics of development in place of a collection of prescientific alchemical lore.

Since it is startups tackling problems that can’t be solved using traditional methods, it is through them that we’ll explore this new System of the World or IT.

It’s Not All About The Cloud, But It’s Mostly About The Cloud

These days the story of startups primarily revolves around the cloud in one way or another. Not completely, not totally, but usually. That’s my inescapable observation based on all thearchitecture profiles I’ve written on Most involve the cloud.

Not all startups choose the cloud, many do not, but even if a startup doesn’t join a formal cloud, we still see the development of cloud-like infrastructures and the deployment of cloud inspired tool chains. So we’ll just skip all the old arguments about OpEx vs CapEx, IaaS vs PaaS vs SaaS, virtualization vs bare metal, public vs private vs hybrid clouds, and open vs closed clouds. Those are all just business decisions made in the pursuit of business goals.

Which specific choices are made isn’t all that important, which is why I’ll use the term cloud in a generic sense. By cloud I do not mean any particular cloud provider or technology.  Zynga, for example, used Amazon extensively, now they’ve built their own cloud to have more control, use fewer servers, and save money. But what they built is still a cloud.

There is a line of controversy worth pursuing that goes something like this: the cloud is no different than what we have been doing in datacenters for years, so what’s the big deal? The cloud is certainly a systematization and productization of capabilities traditionally found in a well staffed datacenter. So in that way the cloud is nothing new.

The key differentiators between a cloud and a datacenter are often said to be multitenancy, geographical distribution, and elasticity. I want to say the key difference between a cloud and a datacenter is democratization. Where once only a few companies could leverage advanced datacenter services, now everyone, great and small can exploit the same capabilities. What was once private is now public. What was once specialized is now generic. What was once scarce is now abundant. Programmers jumped on all these new capabilities and turned them into the most sophisticated ecosystem for IT that we’ve ever seen. That’s a big deal.

So it is in cloud inspired features that a New System of the World can be found, not any particular instance of the cloud.

The Old Datacenter Versus The New Cloud

The quickest way I can think of to illustrate what the New System of the World for IT looks like is to consider the innovative work Netflix is doing in replacing their “in-house IT with the cloud for non-trivial applications with hundreds of developers and thousands of systems.”

Netflix is the poster child for moving from the datacenter to the cloud because they’ve actually done it. Netflix ran their own datacenter and are now 100% cloud. Along the way they’ve done a lot original thinking and on what it means to run an IT-centric business in the cloud.  Adrian Cockcroft, a Cloud Architect at Netflix, has created an amazing Cloud Architecture Tutorial documenting what they’ve learned.

What follows is a list of some major transitions Netflix has made in going from the datacenter to the cloud. The list is a synthesis of slides in the tutorial. It paints a clear picture of how IT in the cloud is different than IT in the datacenter:

Old Datacenter                                                                    New Cloud

Licensed and Installed Applications SaaS (Workday, Pagerduty, EMR)
Central SQL Database Distributed Key/Value NoSQL
Sticky In-Memory Session Shared Memory Cache Session
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service Patterns
Fat Complex Objects Lightweight Serialized Objects
Components as Jar Files Components as Services
Chatty Protocols Latency Tolerant Protocols
Manual and Static Tools Automated and Scalable Tools
SA/Database/Storage/Networking Admins NoOps/OpsDoneMaturelyButStillOps
Monolithic Software Development Teams Organized around Services
Monolithic Applications Building Your Own PaaS
Static and Slow Growing Capacity Incremental and Fast Growing Capacity
Heavy Process/Meetings/Tickets/Waiting Better Business Agility
Single Location Massive Geographical Distribution
Vendor Supply Chains Direct to Developer
Focus on How Much it Costs Focus on How Much Value it Brings
Ownership/CapEx Leasing/OpEx/Spot/Reserved/On Demand


Some principles we see at work are a move to distributed architectures, a focus on generating business value through agility and flexibility, a move away from ownership as a core competency, a separation of concerns along services boundaries, a decentralization and reorganization of processes around services, and a push of responsibility to as close to the developer as possible.

We’ll explore some of these ideas in later sections, but I think this makes it clear we aren’t just talking business as usual, when taken altogether we are talking about something new. It’s a complete transformation at every level.

If you want to say we can do all this in the datacenter I can’t argue, because clouds are built on datacenters. Though I would argue, that once a datacenter can do all these things, it has become a cloud.

The IT World Is Now Flat

Although the New System of the World was pioneered by startups, what has developed, strangely enough, serves to make any enterprise development group just as agile as any startup. The IT world has become flat. There’s now a level playing field across all of IT. The cloud has changed the core economic concepts of delivering business value on top of IT.

A small team in any company can recognize an opportunity, create a product within a week, have it run in many different locations worldwide, with almost no startup capital, and with a low sysadmin burden. Idea to innovation in the time it would have previously taken to work up a hardware request budget proposal.

For some time we’ve had practices like: agile development, extreme automation, short development iterations, continuous integration, continuous deployment, continuous testing, small dedicated teams, and so on. These practices, although much talked about, were seldom implemented.  What slowed adoption was a missing element: the cloud’s programmable IT fabric.

Previously a complex and highly specialized stack was required to follow the agile path. Now it’s easy for any group to develop software this way. And we’ve seen startup after startup adopt these strategies, creating a total revolution in practice on everything about how software is created, distributed, and maintained.

One reason for this revolution is explained by Etsy in terms of Conway’s Law:

When a team makes a product the product ends up resembling the team that made it.

I’ll extend this notion to say the team and thus the product end up resembling the underlying technology used to make it. When you change the underlying development infrastructure, by moving to a cloud, you are bound to change teams and processes they create.

Here are a few examples from startups of how pretty much everything has changed:

  • InstagramGive me a place to stand and with a lever I will move the whole world. An organization with 2 backend engineers can now scale a system to 30+ million users and be bought for a one billion dollars. Regardless of your opinion on the purchase price, the ability for a small organization to handle such a huge user base is an unprecedented amount of leverage.
  • Fidelity. Fidelity is not a startup, but they are creating a next generation internal cloud, saying that the cloud and BigData are creating new rules for IT organizations to innovate. No longer will they be hampered by the organization.
  • NetflixThere’s virtually no process at Netflix. They don’t believe in it. They don’t like to enforce anything. It slows progress and stunts innovation. They want high velocity development. Each team can do what they want and release whenever they want, how often they want. Teams release software all the time, independent of each other. They call this an “optimistic” approach to development.
  • Netflix: NoOps. “We have hundreds of developers using NoOps to get their code and datastores deployed in our PaaS and to get notified directly when something goes wrong. We have built tooling that removes many of the operations tasks completely from the developer, and which makes the remaining tasks quick and self service. There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done, and less time spent actually doing ops tasks than developers would spend explaining what needed to be done to someone else.”
  • EtsyContinuous deployment. Any engineer in Etsy can deploy the whole site to production at anytime. Happens 25 times a day because it’s so easy. It’s a one button deploy. Small change sets are going out all the time, not large deployments. If things go wrong they can quickly figure out what went wrong and fix it. Compare this to the infrequent big bang software updates that are typical.
  • EtsyQA is performed by developers.  Development makes production changes themselves. This has the effect of bringing them closer to production, which enables having an operability mindset. This is opposed to having a ship-to-QA-and-consider-it-done mindset. Developers deploying their own code also brings accountability, responsibility, and the requisite authority to influence production. No Operations engineers stand in the way of a Development engineer from deploying.
  • FacebookSmall, independent teams with both responsibility and control.  Small teams allow work to be done efficiently, quickly, and carefully. Only three people work on photos, for example, the largest photo site on the Internet. But responsibility requires control. If a team is responsible for something they must control it. For example, Facebook pushes code into production everyday. The person who wrote the code is there to fix anything that goes wrong. If the responsibility of pushing and wring code are split, then the code writer doesn’t feel the effect of code that breaks the system. Compare this to the typical separation of developers, QA, and DevOps.
  • FacebookMove Fast. At every level of scale there are surprises. Surprises are quickly dealt with using a highly qualified cross disciplinary team that is flexible and skilled enough to deal with anything that comes up. Flexibility is more important than any individual technical decision. By moving fast Facebook is also able to try more options and figure out which ones work best. Compare this to the typically heavy weight planning and development processes.
  • TripAdvisorNo architects, engineers work across the entire stack. You own your project end to end, and are responsible for design, coding, testing, monitoring. Most projects are 1-2 engineers. If you do not know something, you learn it. The only thing that gets in the way of delivering your project is you, as you are expected to work at all levels.  Compare this to the islands of specialization that are typical in IT.
  • AmazonYou build it, you run it. Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

This is not how software development has been done in the past. What makes it possible is the leverage gained by an IT programming fabric that treats a datacenter and its contained services as being software scriptable. From this base very powerful tool chains like Rightscale, Chef, Puppet, and dozens of others have been developed to make it possible for small teams to quickly do a lot with a little.

Reducing The Mean Time Between Big Ideas

The most startling change post cloud is the increasing pace of innovation. Netflix sees the cloud as laboratory for reducing the mean time between big ideas. James Urquhart says the cloud is a lottery system for developers. Developers can implement something quickly and cheaply, hope it gets 10 million users, hope it succeeds big and if it doesn’t, it wasn’t that expensive to fail.

The software startup landscape itself has been changed forever. Previously you would go to a VC for the many millions needed to even begin an idea, now you are expected to have a prototype ready before even seeing a VC.

Here are a few examples of how startups are making use of these new capabilities to innovate:

  • Robert Scoble talks about how the flood of new startups is just starting. Startups are starting all over the world, not  just Silicon Valley or New York. Now you can start a startup in the middle of nowhere India. The costs of starting a startup have gone way down. Ycombinator used to just have 10 companies a class come out, now they have 60. And each month there are more and more incubators. Two kids can start Instagram and they can start it anywhere.
  • TripAdvisor: Engineering can be best compared to running two dozen simultaneous startups, all working on the same code base and running in a common distributed computing environment. Each of these teams has their own business objectives, and each team is able to, and responsible for, all aspects of their business. Each of the teams operates in the way that best fits their distinct business and personal needs, this process is best described as “post agile/scrum”.
  • Netflix: Runs in Amazon so they can innovate and not have to worry about growth in the future.
  • Netflix: “We built a completely cloud based infrastructure in the US and did some work extracting it so we could actually deploy it anywhere. We set up a bunch of test machines in the AWS Ireland facility and we built the ability to replicate data across both sites. In total we set up 1,000 machines in Ireland. If we had built our own data centre then we would have had to lay down a large amount of money in, say, six months in advance for a really efficient build out, and instead we could use that money to buy movies.”
  • Playfish: The cloud allows Playfish to innovate and try new features and new game with very low friction, which is key in a fast moving market. The cloud allows them to concentrate on what makes them special, not building and managing servers.
  • Zynga: Zynga uses the cloud to deploy their applications and prove them out while handling the load during the process. They then fold applications back into their datacenter once the growth trajectory has been established. It’s not about saving money, it’s about growing business.
  • Steve Lacy: Amazon’s EC2 is a better ecosystem for fast iteration and innovation than Google’s internal cluster management system.  EC2 gives me reliability, and an easy way to start and stop entire services, not just individual jobs.

Typically a datacenter is a lock, a point of serialization for developers that creates a vertical barrier through the entire stack. By unshackling developers from IT infrastructure people it opens up the possibility space and developers can do new things they could never do before.

Of course, the distributed infrastructure of the Internet is essential to the low friction creation and dribution idea and the building of teams and sharing code. And the web and mobile are far more fertile niches for startups than any enterprise landscape. Yet the cloud, by creating an elastic usage model for all services developers, has unshackled developers. Developers can now be sure everything will just work without first having to ask permission. The entire cycle is now developer driven which has thrown an accelerant on the fire of innovation.

It’s Open Source All The Way Down

The foundations for this New System of the World sit squarely on Open Source software. There is virtually no startup you can name that is not built primarily on Open Source. Take a look atTumblr’s stack as a quick example: Linux, Apache, PHP, Scala, Ruby, Redis, HBase, MySQL, Varnish, HA-Proxy, nginx, Memcache, Gearman, Kafka, Kestrel, Finagle, Thrift, HTTP, Func, Git, Capistrano, Puppet, and Jenkins.

It’s all open source and Tumblr is by no means unique, this is a common pattern.

Open Source started with small libraries and has moved up stack with ever larger and more sophisticated components, applications, tools, languages, and operating systems. Now we are seeing movement into Open Source hardware, networking, and even Open Source clouds. At one time this was not true. At one time most software was developed with closed source tool chains. That has completely changed.

While Open Source was firmly established in the programming tools arena, LiveJournal was probably the early example of creating and open sourcing more sophisticated infrastructure tools like memcached and MogileFS. And possibly even more important was that they took the time to talk about the architecture challenges they faced and how they solved them. LiveJournal was the prototype for the early web.

This attitude helped create a virtuous circle in the development community, spawning a tradition that has continuously become more generous and more productive over time. Major companies like NetflixTwitterLinkedInGoogle, and Facebook are not only first to tackle scaling challenges, but they Open Source many of the solutions. And more importantly, they share their experiences and lessons learned with the whole community.

The impact of Open Source on productivity and innovation has been transformative. The advantage Open Source gives you is time. You can do more in less time. If you want to plug into this productivity cycle then you need to align yourself with the Open Source ecosystem. It’s not just for startups, it’s for anyone developing products. Use closed source where it offers a competitive advantage, but the fastest innovation is happening in the Open Source community and that’s with whom you want to make alliances.

It’s Loosely Coupled Services All The Way Down

If Open Source is the foundation for the New System of the World then Service Oriented Architectures are the load bearing walls. As we’ll see, services are not just a software architecture feature anymore, but they’ve become the organizing principle around how teams and software are constructed.

Services have been around forever. Client-server programming was invented as a way for applications to take advantage of networks of computers. This idea was lost on early web architectures that stuffed everything into two or three tier architectures. A browser talked to a web server that invoked code that would return a web page. That code might talk to a database, but it was always a monolithic self-contained blob. As web sites needed to scale, programmers rediscovered client-server programming and started breaking down monolithic applications into cooperating collections of services. Services started talking to other services and soon web servers weren’t application servers anymore, but just a thin layer around a set of service calls. The dependence of rich UIs and mobile applications on backend services has simply continued this evolution.

Here’s a how a number of startups are using Service Oriented Architectures:

  • Wordnik: “We’ve made a significant architectural shift. We have split our application stack into something called Micro Services. The idea is that you can scale your software, deployment and team better by having smaller, more focused units of software. The idea is simple — take the library (jar) analogy and push it to the nth degree. If you consider your “distributable” software artifact to be a server, you can better manage the reliability, testability, deployability of it, as well as produce an environment where the performance of any one portion of the stack can be understood and isolated from the rest of the system. Now the question of “whose pager should ring” when there’s an outage is easily answered! The owner of the service, of course.”
  • Playfish: Service Oriented Architectures are used at Playfish to manage complexity. As new games are added code is split into different components that are managed by different teams. This helps keep the overall complexity of the system down, which helps make everything easier to scale.
  • Amazon:
    • The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications. Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently. Grew into hundreds of services and a number of application servers that aggregate the information from the services.
    • Services are the independent units delivering functionality within Amazon. It’s also how Amazon is organized internally in terms of teams. If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas. Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit.
  • Amazon: “If you think about infrastructure as a service and platform as a service (PaaS), what we’ve built is a PaaS over the top of the AWS infrastructure, which is as thin a layer as we could build, leveraging as many Amazon features as seemed interesting and useful. Then we put a thin layer over that to isolate our developers from it.”
  • Netflix: “If you think about infrastructure as a service and platform as a service (PaaS), what we’ve built is a PaaS over the top of the AWS infrastructure, which is as thin a layer as we could build, leveraging as many Amazon features as seemed interesting and useful. Then we put a thin layer over that to isolate our developers from it.”
  • Netflix: Their architecture is service based. Many small teams of 3-5 person teams are completely responsible for their service: development, support, deployment. They are on the pager if things go wrong so they have every incentive to get it right. They’ve built a decoupled system where every service is capable of withstanding the failure of every service it depends on. Everyone is sitting in the middle of a bunch of supplier and consumer relationships and every team is responsible for knowing what those relationships are and managing them. It’s completely devolved — they don’t have any centralised control. They can’t provide an architecture diagram, it has too many boxes and arrows. There are literally hundreds of services running.
  • Facebook: Each layer is connected via well defined interface that is the sole entry point for accessing that service. This prevents nasty complicated interdependencies. Clients hide behind an application API. Applications use a data access layer. Application logic is encapsulated in application servers that provide an API endpoint. Application logic is implemented in terms of other services. The application server tier also hides a write-through cache as this is the only place user data is written or retrieved, it is the perfect spot for a cache.
  • Tumblr: Built a kind of Rails scaffolding, but for services. A template is used to bootstrap services internally. All services look identical from an operations perspective. Checking statistics, monitoring, starting and stopping all work the same way for all services.
  • “The shift first started with the ascendancy of native mobile apps. Now, developers had to seriously start considering their HTTP APIs as first-class citizens and not nice-to-haves. Once that happened, it’s not a big leap to realize that treating your web application as somehow different from any of your native clients is a bit, well, insane.”

Now everything is kind of like it was before: service based, message passing based, distributed, real-time, queue based, and completely asynchronous. The tools to accomplish all this are different of course, but in principle they are similar.

What’s radically different from the past is the unification of services by rearchitecting entire products as a PaaS. This is made possible by a suite of scalable services linked together using a distributed IT fabric. Architectures can now be elastic and adaptive in ways that are still being explored.

Lifecycle Of A Project: Public Cloud To Private Cloud — Or Vice Versa — Or Both

New in this New System of the World is the idea of federated compute spaces that applicaton functionality can flow between depending on business objectives.

Zynga is the most famous practioner of this form of cloud thermodynamics. Zynga used the public Amazon cloud to deploy their applications, prove them out, and handle load during the initial phases of the release process. Then, once the growth trajectory had been established, they folded the application back into their own datacenter.

It wasn’t an architecture decision based on saving money, it was about growing the business. Zynga has matured and they are now moving off Amazon,  into their own private cloud, in search of lower costs and better performance, but they’ve created an enduring architectural pattern that will work for anyone.

The ability for a business to target business goals with this degree of risk management flexibility was virtually impossible in the rack’em and stack’em age.

Cost Driven Architectures

In the New System of the World how applications are architected has changed forever with the introduction of pay for use models like SaaS, PaaS, and IaaS.

Historically in programming the costs we talk about are time, space, latency, bandwidth, storage, person hours, etc. Infrastructure costs have been part of the capital budget. Someone ponies up for the hardware and software is then “free” until more infrastructure is needed. The dollar cost of software design isn’t usually an explicit factor considered.

Now software design decisions are part of the operations budget. Every algorithm decision you make will have dollar cost associated with it and it may become more important to craft algorithms that minimize operations cost across a large number of resources (CPU, disk, bandwidth, etc) than it is to trade off our old friends space and time.

Different resource costs will force very different design decisions. On Amazon do you use a spot instance, a reserved instance, or an on demand instance? Do you need a small or extra large or one of another dozen instance choices? Do you need to span multiple regions are is working across multiple availability zones acceptable? Should you build your own or used a built-in SaaS? Should you risk lock-in and use more of the built-in services are try to keep as independent as possible?

Just a few short years ago these are all issues you would never have considered before. A phase change has happened in architecture. Even if you aren’t in a public cloud it’s likely you’ll conceptualize your architecture in this way because that’s how the infrastructure tools will be patterned.

Flow Architectures – The Firehose

One of the consequences of using a Service Oriented Architecture is a lot of messages need to be targeted to a lot of different endpoints. And because in the cloud you aren’t standing up a few servers and nailing down connections between them anymore, you need a robust message bus to connect everything together.

The solution that has evolved is the Firehose. A firehose is a message bus that can handle elastic components, message queueing, fault isolation, asynchronous processing, low latency communication, and operating at a high scale.

Here are a few examples of startups using firehose architectures:

  • Tumblr:  Internally applications need access to the activity stream of information about users creating/deleting posts, liking/unliking posts, etc.  A challenge is to distribute so much data in real-time. An internal firehose was created as a message bus. Services and applications talk to the firehose via Thrift. LinkedIn’s Kafka is used to store messages. Internally consumers use an HTTP stream to read from the firehose. The firehose model is very flexible, not like Twitter’s firehose in which data is assumed to be lost. The firehose stream can be rewound in time and it retains a week of data. On connection it’s possible to specify the point in time to start reading. Multiple clients can connect and each client won’t see duplicate data. Each consumer in a consumer group gets its own messages and won’t see duplicates.
  • DataSift: Created an Internet scale filtering system that can quickly evaluate very large filters. It is essentially a giant firehose. 0mq is used for replication, message broadcasting, and round-robin workload distribution. Kafka (LinkedIN’s persistent and distributed message queue) is used for high-performance persistent queues.

An interesting architecture evolution we are seeing in the cloud is how systems continually reorganize themselves to give components better access to information flows. This allows services to be isolated yet still have access to all the information they need to carry out their specialized function. Before firehose style architectures the easiest path was to create monolithic applications because information was accessible only in one place. Now that information can flow freely and reliably between services, much more sophisticated architectures are possible.

Cell Architectures

Another consequence of Service Oriented Architectures is providing services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

Cell Architectures have several advantages:

  • Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
  • Cell are added in an incremental fashion as more capacity is required.
  • Cells isolate failures. One cell failure does not impact other cells.
  • Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.
  • Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
  • Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

A number of startups make use of Cell Architectures:

  • Tumblr: Users are mapped into cells and many cells exist per data center. Each cell has an HBase cluster, service cluster, and Redis caching cluster. Users are homed to a cell and all cells consume all posts via firehose updates. Background tasks consume from the firehose to populate tables and process requests. Each cell stores a single copy of all posts.
  • Flickr: Uses a federated approach where all a user’s data is stored on a shard which is a cluster of different services.
  • Facebook: The Messages service has as the basic building block of their system a cluster of machines and services called a cell. A cell consists of ZooKeeper controllers, an application server cluster, and a metadata store.
  • Salesforce: Salesforce is architected in terms of pods. Pods are self-contained sets of functionality consisting of 50 nodes, Oracle RAC servers, and Java application servers. Each pod supports many thousands of customers. If a pod fails only the users on that pod are impacted.

While the internal structure of a cell can be quite complex, the programmability of the cloud makes it relatively easy to configure, start, stop, failover and respond elastically to load.


We are still figuring out the New System of the World for IT. What was strange just a few years ago is now commonplace. Many discoveries and innovations wait to be made, it will never be complete, but the path has been set.

[repost ]The Conspecific Hybrid Cloud


When you’re looking to add new tank mates to an existing aquarium ecosystem, one of the concerns you must have is whether a particular breed of fish is amenable to conspecific cohabitants. Many species are not, which means if you put them together in a confined space, they’re going to fight. Viciously. To the death. Responsible aquarists try to avoid such situations, so careful attention to the conspecificity of animals is a must.

Now, while in many respects the data center ecosystem correlates well to an aquarium ecosystem, in this case it does not. It’s what you usually get, today, but its not actually the best model. That’s because what you want in the data center ecosystem – particularly when it extends to include public cloud computing resources – is conspecificity in infrastructure.

This desire and practice is being seen both in enterprise data center decision making as well as in startups suddenly dealing with massive growth and increasingly encountering performance bottlenecks over which IT has no control to resolve.


One of the biggest negatives to a hybrid architectural approach to cloud computing is the lack ofoperational consistency. While enterprise systems may be unified and managed via a common platform, resources and delivery services in the cloud are managed using very different systems and interfaces. This poses a challenge for all of IT, but is particularly an impediment to those responsible for devops – for integrating and automating provisioning of the application delivery services required to support applications. It requires diverse sets of skills – often those peculiar to developers such as programming and standards knowledge (SOAP, XML) – as well as those traditionally found in the data center.

quotes“We own the base, rent the spike. We want a hybrid operation. We love knowing that shock absorber is there.” – Allan Leinwand, Zynga’s Infrastructure CTO

Other bottlenecks were found in the networks to storage systems, Internet traffic moving through Web servers, firewalls’ ability to process the streams of traffic, and load balancers’ ability to keep up with constantly shifting demand.

Zynga uses Citrix Systems CloudStack as its virtual machine management interface superimposed on all zCloud VMs, regardless of whether they’re in the public cloud or private cloud.

Inside Zynga’s Big Move To Private Cloud by InformationWeek’s Charles Babcock

This operational inconsistency also poses a challenge in the codification of policies across the security, performance, and availability spectrum as diverse systems often require very different methods of encapsulating policies. Amazon security groups are not easily codified in enterprise-class systems, and vice-versa. Similarly, the options available to distribute load across instances required to achieve availability and performance goals are impeded by lack of consistent support for algorithms across load balancing services as well as differences in visibility and health monitoring that prevent a cohesive set of operational policies to govern the overall architecture.

Thus if hybrid cloud is to become the architectural model of choice, it becomes necessary to unify operations across all environments – whether public or enterprise.


We are seeing this demand more and more, as enterprise organizations seek out ways to integrate cloud-based resources into existing architectures to support a variety of business needs – disaster recover, business continuity, and spikes in application demand. What customers are demanding is a unified approach to integrating those resources, which means infrastructure providers must be able to offer solutions that can be deployed both in a traditional enterprise-class model as well as a public cloud environment.

This is also true for organizations that may have started in the cloud but are now moving to a hybrid model in order to seize control of the infrastructure as a means to address performance bottlenecks that simply cannot be addressed by cloud providers due to the innate nature of a shared model.

quotesThis ability to invoke and coordinate both private and public clouds is “the hidden jewel” of Zynga’s success, says Allan Leinwand, CTO of infrastructure engineering at the company.

— Lessons From FarmVille: How Zynga Uses The Cloud

While much is made of Zynga’s “reverse cloud-bursting” business model, what seems to be grossly overlooked is the conspecificity of infrastructure required in order to move seamlessly between the two worlds. Whether at the virtualization layer or at the delivery infrastructure layer, a consistent model of operations is a must to transparently take advantage of the business benefits inherent in a cross-environment, aka hybrid, cloud model of deployment.

As organizations converge on a hybrid model, they will continue to recognize the need and advantages of an operationally consistent model – and they are demanding it be supported. Whether it’s Zynga imposing CloudStack on its own infrastructure to maintain compatibility and consistency with its public cloud deployments or enterprise IT requiring public cloud deployable equivalents for traditional enterprise-class solutions, the message is clear: operational consistency is a must when it comes to infrastructure.

[repost ]Strategy: Cache Application Start State To Reduce Spin-Up Times


Using this strategy, Valyala, a commenter on Are Long VM Instance Spin-Up Times In The Cloud Costing You Money?, was able to reduce their GAE application start-up times from 15 seconds down to to 1.5 seconds:

Spin-up time for newly added Google AppEngine instances can be reduced using initial state caching. Usually the majority of spin-up time for the newly created GAE instance is spent in the pre-populating of the initial state, which is created from many data pieces loaded from slow data sources such as GAE’s datastore. If the initial state is identical among GAE instances, then the entire state can be serialized and stored in a shared memory (either in the memcache or in the datastore) by the first created instance, so newly created instances could load and quickly unserialize the state from a single blob loaded from shared memory instead of spending a lot of time for creation of the state from multiple data pieces loaded from the datastore.

I reduced spin-up time for new instances of my GAE application from 15 seconds to 1.5 seconds using this technique.

Theoretically the same approach could be used for VM-powered clouds such as Amazon EC2, if the cloud will be able fork()’ing new VMs from the given initial state. Then application developers could boot and pre-configure required services in the ‘golden’ VM, which then will be stored in a snapshot somewhere in a shared memory. The snapshot will be used for fast fork()’ing of new VMs. The VM’s fork() can be much faster comparing to the cold boot of a new VM with required services.

As another commenter noted, GAE now has an Always On feature, which keeps three instances of your app running, but the rub here is you have to pay for the resources you are using. This approach minimizes costs and works across different types of infrastructures.

I’ve successfully used similar approaches for automatically starting, configuring, and initializing in-memory objects across a cluster. In this architecture:

  • Each object has an ID that is mapped to a bag of attributes. Some of those attributes are configuration attributes, some are events, alarms, and dynamic attributes for holding current state.
  • On each node a software system is in charge of figuring out which objects are assigned to which nodes, creating all those objects, and running each object through a startup state machine which includes the object retrieving its state from the database and performing any other required initialization.
  • When all objects have moved to a ready state the node itself would be considered ready for service. The node status was sent to all other nodes which now knew they could use that node for service.

This works great. It minimizes the burden on the application programmer, makes node bring-up fast and easy, and feed directly into an automatic replication and fail-over system.

Related Articles