分类目录归档:Cloud-Common

[repost ]5 Ways To Make Cloud Failure Not An Option

original:http://highscalability.com/blog/2012/12/5/5-ways-to-make-cloud-failure-not-an-option.html

With cloud SLAs generally being worth what you don’t pay for them, what can you do to protect yourself? Sean Hull in AirBNB didn’t have to fail  has some solid advice on how to deal with outages:

  1. Use Redundancy. Make database and webserver tiers redundant using multi-az or alternately read-replicas.
  2. Have a browsing only mode. Give users a read-only version of your site. Users may not even notice failures as they will only see problems when they need to perform a write operation.
  3. Web Applications need Feature Flags. Build in the ability to turn off and on major parts of your site and flip the switch when problems arise.
  4. Consider Netflix’s Simian. By randomly causing outages in your application you can continually test your failover and redundancy infrastructure.
  5. Use multiple clouds. Use Redundant Arrays of Inexpensive Clouds as a way of surviving outages in any one particular cloud.

None of these are easy and it’s worth considering that your application may not need them at all. Life will almost always go on anyway.

Sean has many more details in AirBNB didn’t have to fail.

[repost ]Are We Seeing The Renaissance Of Enterprises In The Cloud?

original:http://highscalability.com/blog/2012/11/5/are-we-seeing-the-renaissance-of-enterprises-in-the-cloud.html

A series of recent surveys on the subject seems to indicate that this is indeed the case:

Research conducted by HPclip_image001 found that the majority of businesses in the EMEA region are planning to move their mission-critical apps to the cloud. Of the 940 respondents, 80 percent revealed plans to move mission-critical apps at some point over the next two to five years.

A more recent survey, by research firm MeriTalkclip_image001[1] and sponsored by VMware and EMC (NYSE:EMCclip_image001[2]), showed that one-third of respondents say they plan to move some mission-critical applications to the cloud in the next year. Within two years, the IT managers said they will move 26 percent of their mission-critical apps to the cloud, and in five years, they expect 44 percent of their mission-critical apps to run in the cloud.

The Challenge – How to Bring Hundreds of Enterprise Apps to the Cloud

The reality is that cloud economics only start making sense when there are true workloads that utilize the cloud infrastructure.

If the large majority of your apps fall outside of this category, then you’re not going to benefit much from the cloud. In fact, you’re probably going to lose money, rather than save money.

The Current Approach

  • Focus on building IaaS - Current cloud strategies of many enterprises has been centered on making the infrastructure cloud ready. This basically means ensuring that they are able to spawn machines more easily than they were before. A quick look at many initiatives of this nature shows that there is still only a small portion of enterprises whose applications run on such new systems.
  • Build a new PaaS - PaaS has been taught as the answer to run apps on the cloud. The reality however, is that most of the existing PaaS solutions only cater to new apps and quite often the small, and “non” mission-critical share of our enterprise applications, which still leaves the majority of our enterprise workload outside of our cloud infrastructure.
  • App Migration as a One Off Project - The other approach for migrating applications to the cloud has been to select a small group of applications, and then migrate these one by one to the cloud. Quite often the thought behind this approach has been that application migration is a one-off project. The reality is that applications are more of a living organism – things fail, are moved, or need to be added and removed over time. Therefore it’s not enough to move apps to the cloud using some sort of virtualization technique, it’s critical that the way they’re run and maintained will also fit the dynamic nature of the cloud.

Why is This not Going to Work?

Simple math shows that if you apply this model to the rest of your apps, it’s probably going to take years of effort to migrate all your apps to the cloud. The cost of doing so is going to be extremely high, not to mention the time to market issue which can be even an even greater risk in the end, as it will reflect on cost of operation, profit margins and even the ability to survive in this an extremely competitive market, if it is too long.

What’s missing?

What we’re missing is a simple and systematic way to brings all these hundreds and thousands of apps to the cloud.

Moving Enterprise Workloads to the Cloud at a Massive Scale

Instead of thinking of cloud migration as a one-off thing, we need to think of cloud migration on a massive scale.

Thinking in such terms drives a fairly different approach.

In this post, I outlined what i believe should be the main principles for moving enterprise application at such a scale.

Read full post:http://www.cloudifysource.org/2012/10/30/moving_enterprise_workloads_to_the_cloud_on_a_massive_scale.html

 

[repost ]点评阿里云盛大云代表的云计算IaaS产业

original:http://www.chinacloud.org/%E7%82%B9%E8%AF%84%E9%98%BF%E9%87%8C%E4%BA%91%E7%9B%9B%E5%A4%A7%E4%BA%91%E4%BB%A3%E8%A1%A8%E7%9A%84%E4%BA%91%E8%AE%A1%E7%AE%97iaas%E4%BA%A7%E4%B8%9A/

 

真不喜欢这么长的文章标题,真怕有的人一口气读不过来,也真怕语文不好的人断句困难,但为了搜索引擎愿意收录,也为了有人看到标题后能够有兴趣读下去,不得不用这个冗长的标题。想用的标题是“回顾和预测中国IaaS产业”,可是,担心有人(甚至可能人数不少)不知道IaaS代表什么,更担心有人除了阿里云盛大云不知道或者不认为中国还有云计算服务。好了,因为种种能说的不能说的原因,名字起得很长了。

再说一下内容,这里不讨论SaaS,不讨论概念,不讨论是真云假云,就讨论IaaS。本来应该顺带一下PaaS,但是,PaaS在中国真的是可以忽略的,下面看心情和时间也许会提到。最后,什么叫IaaS和PaaS这里就不讨论了,可以参见其他的文章。

 

1 IaaS回顾

1.1 云计算在中国有多久

现在是2012年8月底9月初,要问中国的IaaS走过了多少时间了?2年?3年?不对,是差不多4年半的时间。总有一些公司或者在公开场合或者是私下场合说,我们公司做了5年6年云计算,也总有一些猎头跟我说我们要找资深的云计算业内专家,好吧,如果这些人说的是SaaS的话,我承认,可以有。可事实上他们都说的是云计算特别是IaaS,我要说,真没有。如果非要说云计算这个词IaaS这个词还没有产生还没有传入中国,就有人开始做了,那这蛋扯的太远。事实上,Cloud Computing这个单词在2006年之前在英文中并不存在。2006年前后,Cloud Computing这个单词开始偶尔出现。2007年末,Cloud Computing出现的频率迅速增加。2008年初,Cloud Computing在中文中开始被翻译为“云计算”2008年开始。2008年上半年,中国人搞懂云计算这个单词意思的不超过10个。

1.2 谁先开始干云计算和IaaS的

哪个公司或者哪些人是中国最早做云计算的人?这里不讨论云计算概念产生之前的SaaS。阿里云?盛大云?绝对不是,连候选名单都进不了。阿里云公司成立时2009年的事情,而且成立的时候还没有开始做任何事情,只是把阿里软件改了个名而已。盛大云是2010年初开始招兵买马,公司也是那时成立。两家公司推出服务已经是2011年的事情了。那么还有做得更早的?对的,世纪互联(后来独立出来的云快线)是最早开展IaaS实践的公司。如果狭窄点说到虚拟化技术,那么最早的应该是IBM和Intel在中国的部分研发人员接触最早。世纪互联2008年初开始进行IaaS探索,并推出了现今通用的“云主机”一词,2009年初推出云主机beta版,2009年底重组为云快线,2010年底推出云主机2.0。世纪互联云快线今安在?2011年9月解散。世纪互联在IDC算是低调的企业,但其云快线影响力从解散后人员的流向和现今各大云主机网站上对云主机的图形和文字说明(所有做云主机IDC公司都相当程度借鉴了云快线网站和PPT中的内容)可见一斑。

1.2.1 还有谁都参与进来了

世纪互联云快线属于起了个大早干了个晚集,可能还没赶上。随后就是长江后浪推前浪,前浪死在沙滩上。2011年初阿里云和盛大云推出的IaaS服务,迅速吸引了各大媒体和开发者、中小站长的眼球,也把云计算在中国推向了影响力的巅峰。

2011年下半年,上海世纪互联依靠第三方的技术和平台,推出云主机。2012年,杭州网互联LinkCloud、西部数码、太平洋电信陆续推出云主机,如今个主要IDC服务商均有云主机在售。

到底有多少家IDC已经推出了云主机?没有明确数字,因为中小服务商和地方性服务商太多。从百度和谷歌搜索引擎来看,购买了云主机付费关键词的有近20家,其他以云主机为标题和业务的不下50家。

顺便说一下,2006年是国外IaaS起步的时间,中国起步落后了两年,但现在,整个产业至少落后3年。原因?在其他文章有表述,但真分析起来太多太复杂,远离了本文的主题。

再补充一下,还有一些将要或准备进入IaaS领域的,将在预测章节讨论。这里贴两个2010年的预测。目前看,其中市场规模基本正确,只是2013年的规模目前还不能确定。而2011年可能进入IaaS的企业预计太乐观,近一半的企业尚无产品推出。

 

 

 

2 IaaS现状

上面说了,如今已经从事IaaS业务的大小公司不下几十家,且不管严格从定义上来说谁不是谁是云计算和IaaS,暂且认为号称有云主机业务的就算。这里也把这些公司分个三六九等。

2.1 以阿里云盛大云万网云为代表的第一阵营

阿里云盛大云万网云在第一阵营基本无人有异议,这三家不仅是最知名的IaaS厂商,也是市场影响力和客户基数最大的厂商。这三家还有一个共同点,他们在从事IaaS业务之前,都已经是知名公司,而且其IaaS严重依赖其母公司和其他业务。当然,他们也有不同,如果非要排个座次的话,阿里云排第一,盛大云排第二,万网云排第三。说是这三家在第一阵营,并不一定代表他们做的很好,相反,都还是不及格,包括排第一的阿里云,因为我非常非常遗憾的看到,阿里云对于IaaS行业的发展至今还是负面作用大于正面作用。

2.1.1 阿里云

先说阿里云。阿里云排第一,不是阿里云本身做的好,而是相对其他厂家,阿里云没有做得更差,或者可以说,矮子里拔将军吧。这样的评价,业内很多人都很吃惊,阿里云的人更是会嗤之以鼻:“什么?阿里云现在客户最多!我们的平台几次推倒最先进!阿里云的技术实力和带宽质量,加上万网的经验,绝对国内第一国际先进!”

说万网的优势,最大的优势,而且都是很重要的优势有这么几个:

庞大的财力。财力对于做IaaS当然很重要。大家都知道做托管可以不要什么本钱,租个机房零卖就好了,但是建机房确实很费钱的,当然做IDC不是都要做机房的。但是做IaaS虽然投入不像做机房一次性投入那么大,但由于都投入在人才和设备上,相对机房这样的硬货,IaaS的前期投入基本都是投入在软货上。投入机房好理解,不干IT的都能了解,因为财力投入都变成了房子和很多年才能折旧完的制冷设备上。而人才和设备投入就不一样了,人才投入那是费用,投进去就没了,设备折旧也就两三年的事情。IaaS比托管起步要多一些财力投入,因为通常要做一些技术开发工作,投入一些自有设备。而阿里云由来自国内最大的互联网公司之一的财力投入,也许对很多其他公司是笔大投入,对阿里,就是毛毛雨了。但并不是说需要很多人很大一笔钱才能开始起步。但是有钱,绝对是个优势,可以有充足的技术人员,可以等待平台的完善和成长,可以等待盈利的时间更长。技术实力是由庞大的财力决定的,不做单独讨论。

市场知名度和影响力。阿里云凭借阿里巴巴在中小站长和电子商务站长中的知名度和影响力,迅速聚集了一批客户。作为中国顶级的互联网公司,其做任何事情都会得到业界的关注,也更容易在初期开拓市场。看看饭否在新浪微博发布后的表现就明白,大公司的影响力的重要性不逊于财力的重要性。

BGP带宽。阿里巴巴集团凭借前几年以保障电子商务安全为名,申请下来的BGP网络,是一种半垄断优势,也是中国顶级互联网企业里少数认识到BGP网络重要性并付诸行动的公司。这一点,其他具有同等能力的顶级互联网公司逊色不少。这一点,也是绝大部分IDC企业也不具备的硬条件。

基础设施运营经验。阿里巴巴作为国内顶级互联网公司,而且是做电子商务的,在机房管理、网络管理上积累了不少经验,这些对于做IaaS是比不可少的。

收购万网。收购万网为其带来的不是客户和IDC运营经验,主要是牌照问题,客户能够备案了。否则将步盛大后尘,备案服务比较麻烦。

按理说具备了这么多条件和优势,不做成顶级IaaS服务商就没道理了。我这里说他们是IaaS服务商肯定阿里的高层不高兴,因为他们的设计至少是PaaS服务商,IaaS只是个基础。甚至一度要放弃IaaS,押宝阿里云OS和手机,可惜现实的重锤砸碎了这个计划,阿里云OS和搜索、输入法等应用基本没有收入。IaaS的收入也是云主机一家独大。云主机支撑了阿里云的绝大部分收入,这里就从云主机为代表的IaaS说说为什么阿里云过大于功:

没有甚至没有打算建立起共赢生态系统。以阿里巴巴的知名度和国家数亿的补贴,加上上述所说的优势,其获得了大量的资源和庞大的客户。但是,阿里云至今未开放产品接口,也未建立一个其他上下游厂商、合作伙伴共赢的生态系统,甚至到目前为止还未看到阿里云由这样的意向。相反,其产品体系相对封闭,从IaaS到PaaS到应用全都设计,包括安全等产品都是自己开发提供,未能让合作伙伴获益。包括阿里云OS和其他应用在内,都是为了将客户流量和数据留在阿里云,当然,百度腾讯也在做同样的事情。

    其产品模型并没有展示出云计算的特性和优势并让传统IDC客户更广泛受益。目前看,阿里云的主要获取客户手段是BGP带宽和价格,而这两个都是资源导向。其强多了传统IDC的虚拟主机、VPS、小托管客户,但其对客户带来的价值方向并没有超远传统IDC服务商。这也是我初期看好盛大云的原因,盛大云产品模型在当时更好反映了云计算的商业模式。其初期的产品模型更类似传统IDC主机,计费不够灵活。

性能等问题让IaaS的目标客户不能树立起对IaaS的信心。如上所述,其产品模型更多沿用了传统IDC的产品模型。至少在前两年是这样,现在也在试探性对客户的需求进行投票。但是我认为一个领导型的厂商,除了重视客户的现有需求,也要挖掘潜在需求和推广创新性的需求解决方法,这一点我没有看到阿里云有想法和作为;另一个很重要的一点是,阿里云由于在前期一直是技术主导,而且是完美主义者主导,对传统IDC和中小站长的理解和服务经验欠缺,导致现在的云主机性能问题以解决,主要是磁盘的IO性能,尽管其采用了昂贵的SAS硬盘,IOPS有保障,但是IO吞吐带宽成逐渐下降趋势,基本不能满足中等规模应用的要求。最为IaaS领域的领导厂商,都不能让目标用户基本能用,这是对云计算和IaaS产业的极大遏制。导致目前云主机仍然是个人站长的玩乐场,整个云主机产业无法吸引到中型客户。

    战略摇摆不定。阿里云虽然IaaS服务一直没有中断,但从09年成立至今,阿里巴巴集团的对阿里云的战略定位已经数次改变,而且没有一次是定位于IaaS服务。因为大企业都觉得IaaS太基础,太初级,没品味,嘴里不承认,心里都这么想。最开始定位于PaaS,后来是阿里云OS和移动互联网,现在是数据平台。这些定位都符合阿里巴巴集团的战略,确唯独不符合IaaS战略。阿里巴巴作为互联网贸易中介服务商,确实需要一个PaaS扩大非电子商务地盘,需要阿里云OS抢占移动互联网入口,其庞大的交易数据也需要更为庞大并能产生更大价值。所以,也就有了现在的口号“打造数据分享第一平台”,试问阿里云的产品有几个跟数据分享有关系?当IaaS的利润不够丰厚,阿里集团需要阿里云的其他定位,云主机等基础产品还能不能持续加强是个疑问,一如当初的百度有啊,做不做都无所谓,做好了是锦上添花,做不好换个行业比如视频也可以。特别是最近,云OS从阿里云分离出来,集团重申了对云OS的支持、重视,和财务投入2亿美元。没有提到对阿里云的财务投入问题,意味着阿里云将会面临着较紧迫的营收增长和盈利压力。

案服务雾里看花。由于阿里云自身没有IDC运营资质,好在由集团旗下的万网提供备案。代备案也是IDC行业的普遍现象。但由于阿里云迄今为止的所有十几款产品,包括阿里云OS和手机在内,唯有云主机有些起色和收入,而云主机又属于IDC业务的自然延伸,导致阿里云和万网的关系极其微妙,其备案服务政策也数度调整。从基本无法备案,到200元备案,到搞活动取消200元备案费用,到传闻寻找第三方代备案服务商,到现在万网自主上市与万网阿里云和并的传闻并存,其中的博弈和变数另外人不得而知。而备案服务是IDC和云主机的基础服务,而且行规是免费服务,这个服务需要稳定下来。背后折射的是阿里云和万网的定位需要有明确的区分。我想这也是阿里云打出了“数据分享第一平台”这块牌子的苦衷,阿里集团旗下不可能存在两家主业是IDC和云主机、IaaS业务的公司。

 

2.1.2 盛大云

盛大云在创立之初,本来是最被看好的,其产品模式令人感觉确实发挥了云计算的精髓。而那时,阿里云不过是照搬了万网卖服务器托管和租用的界面和产品模式,我想起产品负责人应该来自万网。但现在的形势,完全不同了,问题已经多余优势,其在BGP带宽、自有机房、无IDC经验、无资质上带来的问题,已经严重影响了客户的信心,现在还能排在第2位,主要是还考虑到其创立初期带来的IaaS产品模式上的创新。

盛大云,其初期优势主要有这么几个:

庞大的财力。其同样在初期建立了一只有实力的技术团队,也从国外引入了人才,也同样获得了政府上亿的财政资助。

市场知名度和影响力。盛大作为曾经的中国首富拥有的企业,及时现今退出了一流互联网和游戏企业行列,依然有着巨大的知名度和影响力。在初期也吸引到了大批的用户。

盛大云目前的劣势则更多:

数据中心和带宽问题。由于盛大云长期采用租赁机房,采用第三方CDN服务。这样做确实能够降低运营成本。但要从事IaaS业务,着就变成了一个巨大的劣势。没有自主运营的机房甚至是租赁的机房,没有高质量的多线和BGP带宽资源。

基础设施运营经验。同样,大量的基础设施外包,导致其内部的机房运营、网络管理经验欠缺。

IDC服务经验。不像阿里云收购万网后,多少对IDC服务有所认识。盛大云与IDC的距离更远,而且从国外聘请的人才也对国内IDC政策和形势短期内不清楚或者适应,导致其初期备案和售后被众多客户诟病。

云主机性能和稳定性不足。特别是在华东节点运营的第一年,其云主机稳定性欠缺,网络质量也不够好经常波动,磁盘IO性能也很低,低到运营中小网站都有问题,所有大批用户外逃并失去信心。虽然后来华北节点在这些方面均有所改善,但失去的信心要想找回来,要比刚开始建立信心难上10倍。这里折射出盛大网络集团层面的问题,盛大和陈天桥先生对新业务的容忍期通常不会超过一年,盛大云至今还有投入可能主要还是政府的财政支援。盛大云成立和起步要比阿里云稍微,其推出产品却比阿里云早,说明了内部对盛大云的脚步要求极快,超出了正常的要求。IaaS服务明显不同于网游,其对技术、资源、服务体系的要求带来的准备周期,肯定长于在一个成熟的游戏公司内推出一块游戏的周期。这是欲速则不达的例证。

2.1.3 万网

万网长期处于国内IDC行业的领先阵营,在所有IDC业务中除了托管可能都能占据前两名。其在云主机领域,开始关注的时间较早,但行动的步伐则较慢。这与万网对传统IDC业务的熟悉和稳重的企业风格有关。如果不是投入阿里阵营,其在云主机领域不可能有现今的地位。因为其在研发上的投入极为谨慎,也没有自有的BGP带宽。很可能处于比西部数码略好的境地。而现在则有了超越盛大,并有与阿里云平起平坐之势。

万网的优势:

稳固的IDC领头羊地位和客户群。万网的域名在国内首屈一指,其他服务也排名前列,拥有大量的中小用户。其在IDC行业的影响力堪比阿里在电商领域的影响力。

阿里巴巴的BGP带宽支持。这个不用说了,带宽质量和IP数量无后顾之忧。

阿里云的云计算平台支持。万网的基础云计算平台采用阿里云,虽然未必能够多出彩,起码可以保证投入产出的风险最小,也符合万网的风格。阿里云的云平台虽然并未达到我的预期,却是中小用户基本可用的。而且阿里云也在不断改进其平台。

万网的劣势吧,还真没多少,有一个也是和他的优势相关:

对中大型客户的理解和渗透力。万网的IDC地位依赖于其在个人和中小客户中的影响力,而中大型客户基本都在世纪互联等手上。云主机更是一个替代主机托管和主机租用的产品,而不是一个虚拟主机VPS级别的产品。这种错位延伸可能会导致万网对云主机等IaaS产品的理解和定位偏差。及时能跨过这一关,其人才和企业战略能否适应这个客户群体,还有待观察。

与阿里云的配合和差异化。由于阿里云目前的营收主要靠云主机,且能够获得用户量的也只有云主机一个产品,虽然其推出了10多个产品。所以阿里云不管自己标榜的是要做数据分享的第一平台还是什么,依靠和发展IaaS是其不得不做得选择,否则阿里云基本上在集团内失去了存在的价值。而云主机与传统IDC业务的天然关系,让万网不大可能完全放弃云主机与IaaS。就像当初淘宝云团队和阿里云团队的竞争一样,如今万网IaaS业务与阿里IaaS业务的竞争,也是必须解决的问题,当然,也是风险。两者,应该只能有一家获得集团的支持。

2.2 以LinkCloud西部数码华云为代表的第二阵营

首先要说明的是,第二阵营并不意味着这些企业没有成绩或者产品差,身处第二阵营的主要不利因素是:1 可继承的基础差;2 现阶段获得的客户数量相比第一阵营少。其中第一个因素是决定性因素,通过分析这些IaaS厂家背后的实力、投入、可继承的客户技术和影响力、产品的性价比、获得的客户数量,我发现其实第二正营的投入产出比目前比第一阵营大。所以他们放入第二阵营真正的原因是可继承的基础差,没有庞大的财力、市场知名度、存量客户为基础,尽管用有效的投入取得了相对不错的产出,总体实力和影响力仍然在市场上处于二线阵营。第一第二阵营的排名与百度云主机自然排名是非常接近的。

第二阵营有一个最大的共同点,就是他们都具备一定的研发力量,因此具备了长期发展的基础和可能性。这个研发力量可能是1个技术人员,也可能是十几个,不过,都算是有了自主研发力量。

2.2.1 LinkCloud

第二阵营中以LinkCloud的势头最为强劲,对第一阵营的盛大云与万网形成了一定冲击,且其产品性价比和口碑目前强于盛大云和万网。相对来说,盛大云目前是走下坡路,万网看似稳健实则在云计算上也较急功近利因此主要是转化存量用户。通过研究LinkCloud和阿里云发现,两家在同一座城市,选择LinkCloud的用户清楚的指导自己为何选择以及看重什么,而选择阿里的用户只是模糊的表示阿里巴巴和阿里云是大公司因而产品肯定不错。可以得出的判断是阿里云和早期的盛大云一样,主要靠的是公司的信誉和知名度,以及相对比盛大云做得好一点。这与盛大云一样危险,知名公司的产品如果没有特点,上述所指出的问题不能在近期解决的话,一两个小小的缺口即可导致客户的持续不可挽回的流失。

以LinkCloud西部数码华云为代表的第二阵营,其实也就这三家家。基本上,云主机市场知名度也是按这个顺寻,除了第一阵营,就是LinkCloud云主机、西部数码云主机、华云。这其中又以LinkCloud风头最盛,西部数码由于传统IDC业务的基础也有些影响力,华云目前的影响力比起前两家则要相差不少。LinkCloud最值得关注,并不只是其市场推广带来的知名度较大,更重要的是LinkCloud的研发实力在第二阵营中遥遥领先,超过万网,接近盛大云,以及其性价比和性能的领先。

LinkCloud相对以研发和创新见长,其灵活的计费模式,包括小时计费、流量计费、免申请试用,目前国内即时盛大云、万网云、阿里云,也只是在调研和筹划中。LinkCloud尽管看起来来势迅猛,但其最大软肋是其背后的支持力量单薄,其母公司网银互联作为浙江一线全国二线IDC厂商每年利润当有限,以及杭州所处的商业环境,其能否坚持投入和支持LinkCloud直至LinkCloud盈利,乃是一个未知数。

2.2.2 西部数码

西部数码则主要依靠其IDC传统业务和代理商的影响力,但其5、6月份的软文中所说西部数码代理系统是国内第一个实现实时开通云主机的平台,则未免贻笑大方。不能实时开通,根本就算不上云主机,而第一和第二阵营的其他厂商早已实现。LinkCloud的技术策略比如采取KVM平台才不采用开源虚拟化管理平台,产品策略比如坚持稳定和透明的价格体系,都让其在第二阵营中遥遥领先,也能够对比第一阵营有自己的明显特色。

西部数码虽然进展较慢,但和万网云是同一类型,属于稳步小幅推进,这是由于他们在传统IDC业务上的优势低位决定的,所以将逐步小幅扩大到一定市场份额,这个一定是多大是由其投入和技术等实力决定的。

2.2.3 华云

华这其中变数最大的华云,目前看,华云在研发和产品上投入较大,已推出云主机、云存储、云CDN,但在市场动作和口碑上还未有建树,虽然其前身蓝芒科技也属于全国二线IDC厂商,但其IDC行业软件开发背景和其倾力转型云计算,后面的变数和行业影响也有较大的想象空间。

 

2.3 以ViaCloud和太平洋电信为代表的第三阵营

第三阵营,如同第二阵营,也只有两个个遥遥领先,就是上海世纪互联的ViaCLoud云主机和太平洋电信。论知名度,他们其比第二阵营的华云不差甚至还有甚之。但是,其知名度主要靠百度付费广告,其广告开销也比华云要高。如果他们在研发上也有如此投入的话,即可进入第二阵营。但是,上海世纪互联ViaCloud和太平洋电信臻云均选择了使用第三方的现成的完整平台,这成为其被归入第三阵营的最大原因。而这个平台目前的客户数量很少,其存在的前景未知。相比其他靠人工开设云主机的第三阵营,和广告投放不如它的厂家,ViaCloud和太平洋电信也可以说是遥遥领先的了。这里包括的是所有号称云主机的厂家,但除了ViaCloud,估计大部分厂家都处于线下手工处理客户请求的阶段,他们主要是靠现有的IDC资源和一些开源软件的直接手工使用。

除世纪互联而外,蓝汛、网宿等传统IDC厂商也对云计算所有关注,有所行动的是蓝汛。蓝汛早期与Joyent进行技术合作提供云主机,随着Joyent的离开主要着眼于私有云的托管,如今有云主机和云存储服务。但由于IaaS本身还是IDC的延伸,其产品未见其原有客户范围之外的影响力。其对云计算和云主机的理解和定位、策略可能与其他厂商有差异化。

其他诸如互易中国、云派等数十家IDC服务商有云主机服务,还有一个瑞豪开源做xen vps好久了。

 

2.4 以华为为代表的第四阵营

在前三个阵营中,有做的好的有做的不那么好的,有处于上升势头的又处于下降形势的,有靠传统业务影响力的有新兴势力,形形色色差异很大。但他们却有一个共同点,就是已经明确的在IaaS行业里参与竞争,已经出牌了。还有一个阵营,很特殊,范围广,实力也不小,唯一的共同点是他们都还没有出牌,有的有进入市场的意向,有的只是有这个可能意向都还不明显。这个阵营主要包括准备或者可能进入IaaS的企业,大大小小肯定比现在已经进入的企业数量要多。比如:

2.4.1 华为

华为排第一,是因为华为是最有可能进入的。为什么最,是因为他们有一个正式的业务团队已经运作了一年,公开的IaaS网站已经在运营,只是没有全面开放服务。据了解,包括云主机、云存储、云桌面、云会议四个产品。团队大约200人,目前组织架构上在企业业务产品线。

2.4.2 中国电信

中国电信,其实,在一些地方网站上,已经挂出了云主机业务,当然,不能在线购买,要打电话过去,但是好像也没见推广过。而最新的说法是明年上半年电信的云主机业务正式推出。电信内部的云计算标准10年就开始了,准备不可谓不周密。但IaaS业务,毕竟不是技术和标准的事儿,其云主机到底何时能推出,能有多大竞争力,并不乐观。

2.4.3 百度

百度老大,李彦宏先生,前两年最著名的云计算评语就是“云计算是新瓶装旧酒”,好像是对着旁边坐着的马云说的。今年不同了,百度要做云计算平台,李彦宏已经数度亲自站台为云计算做广告,广告语就不重复了,大意是“百度云计算是新瓶装新酒”。李彦宏先生据说是技术出生,其市场营销意识看来功力不浅。不过期最近推出的云,真正能用的是个网盘,其他能使用的服务都属于开放平台中为站长继承百度应用设计的服务。上半年曾见过百度的云主机的公开文档,但至少最近没有动静。不知道是遇到了困难,还是正在加紧筹备。不过在其产品列表里明确的有虚拟机产品,我想BAE发展应该也不顺利。其作为PaaS服务构建的BEA、云测试、LBS等开发服务,加上运营、推广、变现服务在产品完整性上完全盖过新浪SAE、盛大云引擎、阿里ACE体系,但是影响力连有先发优势的SAE都不如,不用提阿里云和盛大云的IaaS服务。因为PaaS有先天不足,见后文。

2.4.4 腾讯

腾讯,通常都是看着别人做的有点规模了,不声不响,一推出就能迅速抢占一个山头。当然,偶尔也有失利的时候。可以看见的是,腾讯正加紧整合期IDC资源,BGP网络建设应该已经提到日程,内部的类IaaS服务平台一年前已经开始,可能部分服务内部也在使用了。但还没有看到其提供IaaS类服务的迹象,最有可能的是先充实其开放平台,或者推出PaaS服务。

2.4.5 其他

除上述几个阵营外:

  • 网易

网易也在内部有一个很小的研究团队,大概10人左右,据称有意进入IaaS市场。

  • 京东

京东有一个网盘服务,内部也在尝试私有云方案,公有云短期当不会推出。

  • 完美时空

完美时空据称对私有云也有兴趣。

  •   360

360内部应当也有一个几个人规模的私有云研究团队。

  • 三大运营商

三大运营商,电信联通移动,都有一支不小的云计算研究和标准团队,主要偏向内部私有云,同时为包括IaaS PaaS SaaS在内的公有云服务大平台做准备。目前,至少电信已经成立了独立的云计算事业部或者子公司。

  • 金山

金山成立的金山云目前主要专注于快盘,据称要进入企业级存储服务,可能是备份或云存储。

  •   ezCloud

由世纪互联前总裁雷紫东和蒋清野共同创立的IaaS服务公司,包括云主机、云安全、云存储。

  • Ucloud

由盛大云前联席CEO季昕华创立,目前包括云主机、云硬盘、CDN等。

 

3 PaaS

无论是在国内还是国外,PaaS都不是一个主流的云计算细分行业。在国内,PaaS还算不上一个行业。不知道是不是PaaS天生只能是封闭和专有的,至少目前我看到的PaaS服务都是封闭的和专有的,而且这个封闭和专有的程度远超IaaS。即时全部使用开源技术,或者将PaaS平台开源,也改变不了其专有的属性。因为他要求开发和部署者采用迥异于当前的开发和部署环境,而且可迁移性几乎不具备。

就国外来说,最具特色的要数Joyent的服务。其平台都采用开源技术构建,提供了一个优雅的开发部署环境。奈何,其Solaris出身的不幸,以及Ruby和Python并未占据更大的开发者份额的现状,阻碍了其发展的想象空间。Google App Engine 和 微软Azure 确实是很知名的了,但GAE一直接受度不高,Azure不但加入IaaS特性及微软体系以外的开源组件,显示了他们现阶段的无力。

之所以大型企业钟情于PaaS,还是由于其能建立一个封闭的生态圈,更有可能留住流量和网络应用。这些企业也过于自行自身企业现有平台的价值和对开发者的影响力。

就国内来说,无论是推出有一段时间的新浪SAE,还是阿里云ACE和盛大云云引擎,其赢得的客户数量和市场影响力知名度,在IaaS服务面前,都是可以忽略的。在新浪SAE web服务器更独立、百度推出虚拟机之前,尽管新浪和百度都用其所有资源支持站长利用这些PaaS服务建立应用,这些PaaS难以建立实际位置。但是,我欢迎这种新产品的尝试,创新和革新有时候就隐藏在这些不被主流认可的技术和模式中。

4 IaaS预测

2010年,曾有一些预测,主要包括IaaS市场规模和参与者,规模基本正确,参与者由于市场的复杂性,有对有错。本文大部分内容主观性较强,代表了个人观点。而预测,如果说不算谣言的话,那就是主观中的主观,权当娱乐了。主要有下面这些方面:

  • PaaS在5年内不能成为主流。

即PaaS业务本身的客户数量和业务营收是传统IDC和IaaS的零头,其影响力也仅限于技术爱好者,其上产生的全球排名10W内独立网站将屈指可数。

  • 纯PaaS服务将向IaaS/PaaS混合体发展

纯PaaS服务在推出前期依赖公司的输血,在吸引开发者驻留和建立封闭生态圈不大成功后,处于持续发展和营收的需要,将发展IaaS业务。这也应当是一个潜在的趋势,阿里云和盛大云也推出了PaaS服务。

  • 阿里云在2013年上半年面临众IDC企业的冲击和蚕食

阿里云目前的上升势头还将持续半年到一年,从2012年上半年开始会被陆续进入的原IDC公司蚕食。IDC公司可能通过自主研发或者合作的形势寻找到能够提供云计算业务的办法,因为IaaS就是IDC业务的自然延伸。阿里云可能在2013年下半年停止上升趋势,可能会出现下降趋势,但5年内仍会有一块稳固的市场份额。这个结论有个前提,阿里云在发展放缓和停滞时,仍能获得集团的全面支持,包括带宽和盈利前的资金投入以及协调万网的资质支持。没有这个前提,三年内也就迅速面临盛大一样的解散危机,就是下一个预言。

  • 阿里云或在三年内面临合并与解散的可能性

鉴于阿里巴巴集团旗下不可能同时存在两家IaaS为主业的公司,也不会允许阿里云如万网所期望成为一家研发和技术支持公司,阿里云只有一条路,不是现在标榜的“数据分享第一平台”,而是建立一个生态系统的IaaS/PaaS平台,并迅速实现盈利。这已经不是阿里集团允许阿里云亏损几年的问题,而是万网面临生存危机问题,阿里云着眼于IaaS/PaaS业务对万网绝对是压力和威胁,万网会从集团和市场两个层面给阿里云压力,或专注于技术或转型。现在阿里云发展良好,只是给了万网压力,却未能动摇万网在IDC领域的低位。没有万网的支持,阿里云资质暂时看不到解决的办法,如果上个预言达成,阿里云在业务规模和营收上停止增长,而万网在市场上有所成就,再联系现在阿里云和云OS分家,给了云OS2亿美元的支持却没有提到阿里云,万网吃下阿里云不是不可能。万网的态度将决定阿里云是被合并还是解散。阿里云持续的营销或许可以看成这个焦虑的外在延伸。

  •  万网或3年内合并阿里云

这是从上个预言导出的附带预言,万网独立上市,阿里云被万网合并,阿里巴巴在万阿里云上的投入也算有了兑现的机会,人员也可安排,结局总好过阿里软件。同阿里软件一样,阿里云在集团内是一个在技术和业务上都相对独立的单元,其他业务都有自己的技术团队和技术架构,包括在云的各个层面。以万网在研发布局上的谨慎,阿里巴巴集团必须给予其他条件,才可能达成这件有难度的交易。

  • 盛大云在2013年下半年面临解散的危险

盛大云目前内部可能在努力,但在客户层面新用户的发展几近停滞,可以认为其发展势头开始向下。盛大集团本身对新业务容忍时间一般为1-2年,但盛大云收到国家和上海本地的大量财政补助,容忍3年应该不成问题。从2011年开始计算,2013年下半年到2014年上半年为其发展的关键期,如果自身不能实现盈利,政府资助不再继续,将面临解散的危险。

  • 华为云服务或3个月内上线

华为云服务其实官网已经可以访问,只是还没有开放业务,也没有进行推广

  • 网易云夭折于上线

网易云如果着眼于IaaS,从目前其公司的支持和投入力度看,上线很成问题。

  • 华为云百度云腾讯云网易云将昙花一现

华为云从IaaS,可能会捎带上SaaS,百度云腾讯云从PaaS切入,可能捎带上IaaS,他们背后的庞大体量并不足以让其在IaaS领域有天然的优势,如果网络互联互通和IDC业务准入不发生彻底改变,他们在宽带和资质上就很难取得突破,IDC和IaaS的利润率较低,PaaS业务不能带来预想的战略结果,那么可能作为开放平台一部分额PaaS业务得以持续,IaaS业务可能被边缘化或消失。

  • IDC厂商将在2013年加速进入IaaS

尽管目前已经有数十家IDC厂商已经或者正在从事云主机业务,但2013年将会是IDC厂商加速进入IaaS和PaaS的一年,较有实力IDC厂商将会以自主开发或合作、购买的形式加入IaaS。

  • 运营商公有云IaaS服务或继续酝酿一年

即使以三大运营商动作最快的电信来看,其自主推出IaaS服务,虽然从不同渠道传出一些2013年上半年推出云主机的声音,但个人认为未来一年仍将处于酝酿期。

  • 世纪互联或联合微软推出公有云

世纪互联自云快线失败后,已经基本上丧失了独立推出云服务的可能,亚马逊AWS由于过于偏重于IaaS,且亚马逊与世纪互联曾有业务纠葛,最有可能仍然是联手微软推出Azure服务。

原文来自 汉唐月

声明:任何转载需全文转载注明来源并保留声明;引用需注明来源;本文拒绝SaaS博士之流转载和点评;欢迎任何非商业目的的转载和二次编辑;任何形式商业用途请联系作者本人。

[repost ]虚拟化、云计算、开放源代码及其他

original:http://www.qyjohn.net/?p=1552

借国庆长假的机会写了这篇长文,全面地整理了个人从虚拟化到云计算各个层面的看法。主要的内容涉及虚拟化、虚拟化管理、数据中心虚拟化、云计算、公有云与私有云、以及开放源代码。本文的全部内容均属于作者的个人观点,而不代表任何公司的观点。欢迎讨论。

A、虚拟化

虚拟化是指在同一台物理机器上模拟多台虚拟机的能力。每台虚拟机在逻辑上拥有独立的处理器、内存、硬盘和网络接口。使用虚拟化技术能够提高硬件资源的利用率,使得多个应用能够运行在同一台物理机上各自拥有彼此隔离的运行环境。

虚拟化的也有不同的层次,例如硬件层面的虚拟化和软件层面的虚拟化。硬件虚拟化指的是通过模拟硬件的方式获得一个类似于真实计算机的环境,可以运行一个完整的操作系统。在硬件虚拟化这个层面,又有Full Virtualization(全虚拟化,几乎是完整地模拟一套真实的硬件设备。大部分操作系统无须进行任何修改即可直接运行在全虚拟化环境中。)、Partial Virtualization(部分虚拟化,仅仅提供了对关键性计算组件或者指令集的模拟。操作系统可能需要做某些修改才能够运行在部分虚拟化环境中。)和Paravirtualization(半虚拟化,不对硬件设备进行模拟,虚拟机拥有独立的运行环境,通过虚拟机管理程序共享底层的硬件资源。大部分操作系统需要进行修改才能够运行在半虚拟化环境中。)等不同的实现方式。软件层面的虚拟化,往往是指在同一个操作系统实例的基础上提供多个隔离的虚拟运行环境,也常常被称为容器技术。

在硬件虚拟化的层面,现代的虚拟化技术通常是全虚拟化和半虚拟化的混合体。常见的虚拟化技术例如VMWare、Xen和KVM都同时提供了对全虚拟化和半虚拟化的支持。以硬件虚拟化的方式所提供的虚拟机,通常都在运行一个完整的操作系统,在同一台宿主机上存在大量相同或者相似的进程和内存页,从而导致明显的性能损耗。目前,通过KSM等技术可以识别与合并含有相同内容的内存页,但是还没有对大量相同或者相似的进程进行优化处理的有效手段。因此,硬件虚拟化也往往被称为重量级虚拟化,在同一宿主机上能够同时运行的虚拟机数量是相当有限的。在软件虚拟化的层面,同一宿主机上的所有虚拟机共享同一个操作系统实例,不存在由于运行多个操作系统实例所造成的性能损耗。因此,软件虚拟化也往往被称为轻量级虚拟化,在同一宿主机上能够同时运行的虚拟运行环境数量是比较宽松的。以Solaris操作系统上的Container为例,一个Solaris操作系统的实例理论上可以支持多达8000个Container(实际能够运行的Container数量取决于系统资源和负载)。与此类似,Linux操作系统上的LXC可以轻松地在同一宿主机上同时支持数量可观的虚拟运行环境。

在虚拟化这个领域,国内的公司对硬件虚拟化的兴趣较大,在研发和生产环境中也大都采用硬件虚拟化技术。淘宝是国内较早地研究并应用软件虚拟化技术的,他们在淘宝主站的实践经验表明使用cgroup替代Xen能够提升资源利用率。至于在一个实际的应用场景中到底应该选择硬件虚拟化还是软件虚拟化,则应该重点考虑最终用户是否需要对操作系统的完全控制权(例如升级内核版本)。如果最终用户仅仅需要对运行环境的控制权(例如PaaS层面的各种App Engine服务),软件虚拟化可能性价比更高。对于为同一应用提供横向扩展能力的应用场景,软件虚拟化也是比较好的选择。

对于需要深入了解虚拟化技术的技术人员来说,VMWare发表的白皮书《Understanding Full Virtualization, Paravirtualization, and Hardware Assist》是一份很好的参考资料。

通常来讲,能够直接使用虚拟化技术的用户数量是比较少的。以Linux操作系统为例,能够进行虚拟机生命周期管理的用户,一般就是具有访问libvirt权限的用户。在一个公司或者其他实体中,这些用户通常是系统管理员。

B、虚拟化管理

早期的虚拟化技术,解决的是在同一台物理机上提供多个相互独立的运行环境的问题。当需要管理的物理机数量较小时,系统管理员可以手动登录到不同的物理机上进行虚拟机生命周期管理(资源配置、启动、关闭等等)。当需要管理的物理机数量较大时,就需要写一些脚本/程序来提高虚拟机生命周期管理的自动化程度。以管理和调度大量物理/虚拟计算资源为目的软件,称为虚拟化管理工具。虚拟化管理工具使得系统管理员可以从同一个位置执行如下任务:(1)对不同物理机上的虚拟机进行生命周期管理;(2)对所有的物理机和虚拟机进行查询甚至监控;(3)建立虚拟机命名与虚拟机实例直接的映射关系,使得虚拟机的识别和管理更加容易。Linux操作系统上的VirtManager是一个简单的虚拟化管理工具。在VMWare产品家族中,VMWare vSphere是一个功能强大的虚拟化管理工具。

虚拟化管理工具是虚拟化技术的自然延伸。简单的虚拟化管理工具,解决的是由于物理机数量增多所导致的工作内容繁杂问题。在这个层面,虚拟化管理通常和集群的概念同时出现。一个虚拟化管理工具,往往需要获得各台物理机上的虚拟机生命周期管理权限(例如具有访问libvirt权限的用户名和密码)。在同一个集群当中,为了方便起见,可能需要设定一个在整个集群层面通用的管理用户。可以认为,虚拟化管理为系统管理员提供了便利,但是并没有将虚拟机生命周期管理的权限下放给其他用户。

C、数据中心虚拟化

在数据中心的层面,系统管理员需要面对大量不同类型的硬件和应用。与小型的集群相比较,数据中心的系统复杂度大大提高了。这时简单的虚拟化管理工具已经无法满足系统管理员的要求,因此在虚拟化管理工具的基础上又发展出各种数据中心虚拟化管理系统。在硬件层面,数据中心虚拟化管理系统通过划分资源池(一个资源池通常是一个集群)的方式对硬件资源进行重新组织,并以虚拟基础构架(Virtual Infrastructure)的方式将计算资源暴露给用户。在软件层面,数据中心虚拟化管理系统引入系统管理员和普通用户两种不同的角色,甚至是基于应用场景的需要设定颗粒度更细的基于角色的权限控制(Role Based Access Control,RBAC)。系统管理员对整个数据中心的物理机和虚拟机拥有管理权限,但是一般不对正常的虚拟机进行干涉。普通用户只能在自己具有权限的资源池内进行虚拟机生命周期管理操作,不具有控制物理机的权限。在极端的情况下,普通用户只能够看到分配给自己的资源池,而不了解组成该资源池物理机细节。

在数据中心虚拟化之前,创建虚拟机的动作是需要系统管理员来完成的。在数据中心虚拟化管理系统中,通过基于角色的权限控制,虚拟机生命周期管理的权限被下放给所谓的“普通用户”,在一定程度上可以减轻系统管理员的负担。但是,出于系统安全的考虑,并不是公司里所有的员工都能够拥有这样的“普通用户”账号。一般来说,这种“普通账号”只能够分配给某个团队的负责人。可以认为,一直到数据中心虚拟化这个层面,虚拟机的生命周期还是集中式管理的。

数据中心虚拟化管理系统是虚拟化管理工具的进一步延伸,它所解决的是由于硬件和应用规模上升所带来的系统复杂度问题。具体的物理设备被抽象成资源池之后,公司高管只需要了解各个资源池的规模、负载和健康状况,最终用户只需要了解分配给自己的资源池的规模、负载和健康状况。只有系统管理员还需要对每一台物理设备的配置、负载和故障了如指掌,但是资源池的概念也从逻辑上对所有的物理设备进行了重新整理和分类,使得系统管理员的工作变得更加容易了。

现代的数据中心虚拟化管理系统,往往提供了大量有助于运维自动化的功能。这些功能包括 (1)基于模板快速部署一系列相同或者是相似的运行环境;(2)监控、报表、预警、会计功能;和(3)高可用性、动态负载均衡、备份与恢复等等。一些相对开放的数据中心虚拟化管理系统,甚至以开放API的方式使得系统管理员能够根据自身的应用场景和流程开发额外的扩展功能。

在VMWare产品家族中,VMWare vCenter是一个数据中心虚拟化管理软件。其他值得推荐的数据中心虚拟化管理软件包括Convirt、XenServer、Oracle VM、OpenQRM等等。

D、云计算

云计算是对数据中心虚拟化的进一步封装。在云计算管理软件中,同样需要有云管理员和普通用户两种(甚至更多)不同的角色以及不同的权限。管理员对整个数据中心的物理机和虚拟机拥有管理权限,但是一般不对正常的虚拟机进行干涉。普通用户可以通过浏览器自助地进行虚拟机生命周期管理 ,也可以编写程序通过Web Service自动地进行虚拟机生命周期管理。

在云计算这个层面,虚拟机生命周期管理的权限被彻底下放真正的普通用户,但是也将资源池和物理机等等概念从普通用户的视野中屏蔽了。普通用户可以获得计算资源,但是无需对其背后的物理资源有任何了解。从表面看,云计算似乎就是以与Amazon EC2/S3相兼容的模式提供计算资源。在实质上,云计算是计算资源管理的模式发生了改变,最终用户不再需要系统管理员的帮助即可自助地获得获得和管理计算资源。

对于云管理员来说,将虚拟机生命周期管理权限下放到最终用户并没有降低其工作压力。相反,他有了更加令人头疼的事情需要去处理。在传统的IT架构中,往往 是一个应用配备一套计算资源,应用之间存在物理隔离,问题诊断也相对容易。升级到云计算模式之后,多个应用可能共享同一套计算资源,应用之间存在资源竞 争,问题诊断就相对困难。因此,云管理员往往希望选用的云计算管理软件能够有相对全面的数据中心虚拟化管理功能。对于云管理员来说,至关重要的功能包括 (1)监控、报表、预警、会计功能;(2)高可用性、动态负载均衡、备份与恢复等等;和(3)动态迁移,可以用于局部负载调整以及故障诊断。

显而易见,从虚拟化到云计算,对物理资源的封装程度不断提高,虚拟机生命周期的管理权限逐步下放。

在VMWare产品家族中,VMWare vCloud是一个云计算管理软件。其他值得推荐的云计算管理软件包括OpenStack、OpenNebula、Eucalyptus和CloudStack。虽然OpenStack、OpenNebula、Eucalyptus和CloudStack都是云计算管理软件,但是其功能有较大的差别,这些差异源于不同 的软件具有不同的设计理念。OpenNebula和CloudStack最初的设计目标是数据中心虚拟化管理软件,因此具有比较全面的数据中心虚拟化管理 功能。云计算的概念兴起之后,OpenNebula增加了OCCI和EC2接口,CloudStack则提供了称为CloudBridge的额外组件 (CloudStack从 4.0版本开始缺省地包含了CloudBridge组件),从而实现了与Amazon EC2的兼容。Eucalyptus和OpenStack则是以Amazon EC2为原型自上而下地设计成云计算管理软件的,从一开始就考虑与Amazon EC2的兼容性(OpenStack还增加了自己的扩展),但是在数据中心虚拟化管理方面的功能尚有所欠缺。在这两者当中,Eucalyptus项目由于起步较早,在数据中心虚拟化管理方面的功能明显强于OpenStack项目。

E、私有云与公有云

如D 所述的云计算,仅仅是一种狭义上的云计算,或者是与Amazon EC2相类似的云计算。 广义上的云计算,可以泛指是指各种通过网络访问物理/虚拟计算机并利用其计算资源的实践,包括如D 所述的云计算和如C 所述的数据中心虚拟化。这两者的共同点在于云计算服务提供商以虚拟机的方式向用户提供计算资源,用户无须了解虚拟机背后实际的物理资源状况。如果某个云平台仅对某个集团内部提供服务,那么这个云平台也可以被称为“私有云”;如果某个云平台对公众提供服务,那么这个云平台也可以被称为“公有云”。一般来说,私有云服务于集团内部的不同部门(或者应用),强调虚拟资源调度的灵活性(例如最终用户能够指定虚拟机的处理器、内存和硬盘配置);公有云服务于公众,强调虚拟资源的标准性(例如公有云服务提供商仅提供有限的几个虚拟机产品型号,每个虚拟机产品型号的处理器、内存和硬盘配置是固定的,最终用户只能够选择与自身需求最为接近的虚拟机产品型号)。

对于公有云服务提供商来说,其业务模式与Amazon EC2相类似。因此,公有云服务提供商通常应该选择如D 所述的云计算管理软件。对于私有云服务提供商来说,则应该根据集团内部计算资源的管理模式来决定选用的软件。如果对计算资源进行集中式管理,仅仅将虚拟机生命周期管理的权限下放到部门经理或者是团队负责人这个级别,那么就应该选择如C 所述的数据中心虚拟化管理系统。如果要将虚拟机生命周期管理的权限下放到真正需要计算资源的最终用户,则应该选择如D 所述的云计算管理软件。

传统上,人们认为私有云是建立在企业内部数据中心和自有硬件的基础上的。但是硬件厂商加入云计算服务提供商的行列之后,私有云与公有云之间的界限变得越来越模糊。Rackspace推出的私有云服务,客户可以选择使用自有的数据中心和硬件,也可以选择租用Rackspace的数据中心和硬件。Oracle最近更进一步提出了“由Oracle拥有并管理”( Owned by Oracle, Managed by Oracle)的私有云服务。在这种新的业务模式下,客户所独享的私有云是仅仅是云服务提供商的公有云当中与其他客户相对隔离的一个资源池(you got private cloud in my public cloud)。而对于云服务提供商来说,用于提供公有云服务的基础构架可能仅仅是其自有基础构架(私有云)中的一个资源池,甚至是硬件厂商自有基础构架(私有云)中的一个资源池(you got public cloud in my private cloud)。

对于客户来说,使用基于云服务提供商的数据中心和硬件的私有云服务在财务上是合理的。这样做意味着自建数据中心和采购硬件设备的固定资产投入(CapEX)变成了分期付款的运营费用(OPEX),宝贵的现金则可以作为用于拓展业务的周转资金。即使长期下来拥有此类私有云的总体费用比自建数据中心和采购硬件设备要高,但是利用多出来的现金进行业务拓展所带来的回报可能会超过两个方案之间的费用差额。在极端的情况下,即使企业最终没有获得成功,也无需心疼新近购置的一大堆硬件设备。除非是房地产市场在短时间内有较大的起色,一家濒临倒闭的公司通常是不会为没有自建一个数据中心而感到后悔的。(需要指出的是,对于一家能够长时间运作的公司来说,通过房地产来盈利是完全有可能的。在Sun 公司被Oracle公司收购之前,就曾经通过变卖祖业的方式使得财报扭亏为盈。)

那么,硬件厂商在这场游戏里面扮演的是什么角色呢?当用户的固定资产投入(CapEX)变成了分期付款的运营费用(OPEX)时,硬件厂商难道不是需要更长的时间才能够收回货款吗?

1865年,英国经济学家威廉杰文斯(Willian Jevons,1835-1882)写了一本名为《煤矿问题》(The Coal Question)的书。杰文斯描述了一个似乎自相矛盾的现象:蒸汽机效率方面的进步提高了煤的能源转换率,能源转换率的提高导致了能源价格降低,能源价格的降低又进一步导致了煤消费量的增加。这种现象称为杰文斯悖论,其核心思想是资源利用率的提高导致价格降低,最终会增加资源的使用量。在过去150年当中,杰文斯悖论在主要的工业原料、交通、能源、食品工业等多个领域都得到了实证。

公共云计算服务的核心价值,是将服务器、存储、网络等等硬件设备从自行采购的固定资产变成了按量计费的公共资源。虚拟化技术提高了计算资源的利用率,导致了计算资源价格的降低,最终会增加计算资源的使用量。明白了这个逻辑,就能够明白为什么HP会果断加入OpenStack的阵营并在OpenStack尚未成熟的情况下率先推出基于基于OpenStack的公有云服务。固然,做云计算不一定能够拯救HP于摇摇欲坠之中,但是如果不做云计算,HP恐怕就时日不多了。同样,明白了这个逻辑,就能够明白为什么Oracle会从对云计算嗤之以鼻摇身一变称为云计算的实践者。收购了Sun 公司之后,Oracle一夜之间变成了世界领先的硬件提供商。当时云计算的概念刚刚兴起,Oracle不以为然的态度说明它尚未充分适应自身地位的变化。如今云计算已经从概念炒作进入实战演习阶段,作为主要硬件厂商之一的Oracle如果不打算从云计算中分一杯羹的话,那就是真正的反射弧过长了。

根据杰文斯悖论,对于用户来说,价格降低是用量增加的前提。那么,应该如何给云计算资源定价呢?

目前,大部分公有云服务提供商的虚拟机产品都是按照配置定价的。以Amazon EC2为例,其中型(Medium)虚拟机(3.75 GB内存,2 ECU计算单元,410 GB存储,0.16美元每小时)的配置是小型(Small)虚拟机(1.7 GB内存,1 ECU计算单元,160 GB存储,0.08美元每小时)的两倍,其价格也是小型虚拟机的两倍。新近推出的HP Cloud Services,以及国内的盛大云和阿里云,基本上都照搬Amazon EC2的定价方法。问题在于,虚拟机的配置提高之后,虚拟机的性能并没有得到同比提高。一系列针对Amazon EC2、HP Cloud Services、盛大云和阿里云的性能测试结果表明,对于多种类型的应用来说,随着虚拟机配置的提高,其性价比实际上是不断降低的。这样的定价策略,显然不能达到鼓励用户使用更多计算资源的目的。

按照虚拟机的性能来定价可能是一个更加合适的做法。举个例子说,某个牌子的肥皂有大小两种包装,小包装有一块肥皂而大包装有两块肥皂。用户愿意花双倍的钱购买大包装,往往是因为大包装能够洗两倍的衣服而不是因为它看起来更大。同理,来自同一公有云服务提供商的不同虚拟机产品,应该尽可能使其性价比维持在同一水平线上。问题在于,不同类型的应用对处理器、内存和存储等计算资源的需求存在较大差异,其“性能–配置”变化曲线也各有不同。因此,在公有云服务领域需要一个对虚拟机性能进行综合评估的框架,通过该框架获得的评估结果可以表示一台虚拟机的综合处理能力,而不仅仅是处理器、内存和存储当中的任何一项。基于这样一个测试框架,不仅可以对同一公有云服务提供商的产品进行比较,还可以对不同公有云服务提供商的产品进行比较。

F、开放源代码

近些年来,我们在信息技术领域观察到一个规律。当一个闭源的解决方案在市场上取得成功时,很快就会出现一个甚至是多个提供类似功能(或者服务)的开源或者闭源的追随者。(首先出现开源软件,然后出现与之竞争的闭源软件的案例比较少见。)在操作系统领域,Linux逐渐达到甚至是超越了Unix的技术水平,进而取代Unix的市场地位。在虚拟化领域,Xen和KVM紧紧跟随VMWare的技术发展并有所突破,逐步蚕食VMware的市场份额。在云计算领域,Enomaly率先推出了以Amazon EC2为蓝本的闭源解决方案,紧跟着又出现了以Eucalyptus和OpenStack为代表的开源解决方案。与此同时,传统意义上的闭源厂商对开源项目和社区的态度也在发生转变。例如,多年来对开源项目持敌视态度的微软于今年四月组建了一家名为“微软开放技术”(Microsoft Open Technologies)的子公司,其目标是推进微软向开放领域的投资,包括互操作性、开放标准和开源软件。

我们今天所处的商业环境,与上个世纪80年代自由软件运动(Free Software Movement)刚刚兴起的时候已经有了较大的不同。自1998年NetScape第一次提出开放源代码(Open Source)这个术语起,开放源代码就已经成为一种新的软件研发、推广与销售模式,而不再是与商业软件相对立的替代品了。与传统的闭源软件商业模式相对比,基于开放源代码的商业模式具有如下特点:

(1)在项目萌芽阶段,通过开源软件或者自由软件等关键词吸引潜在客户以及合作伙伴。对于潜在客户来说,选择开源软件能够免费或者是低价获得闭源软件的(部分)功能。对于合作伙伴来说,其兴趣点可能在于销售基于开源软件的增强版本(例如企业版),提供基于开源软件的解决方案,或者是该开源软件的成功可能对其自身的产品的销售有促进作用。

(2)在项目成长阶段,主要的研发人员来自发起项目的企业以及该项目的企业合作伙伴。虽然也有一些单纯出于兴趣而向开源项目贡献代码的个人开发者,但是其数量相对较少。我们在开源软件的宣传资料当中经常会见到类似于“由某某社区开发”的描述。最近10年来,各种“社区”中的主要研发力量始终来自数量极为有限的企业合作伙伴。但是有些开源项目在宣传中通常会有意无意地淡化企业合作伙伴的重要性,甚至是误导受众以为社区的主要成分是个人开发者。

(3)在项目收割阶段,项目发起者以及主要合作伙伴可以通过销售增强版本或者是提供解决方案获取财务回报。虽然其他厂商也可以提供类似的产品或者服务,但是开源项目的主要参与者往往在市场上拥有更大的话语权和权威性。关于开源项目的盈利问题,Marten Mickos(Eucalyptus的CEO)在担任MySQL公司CEO期间曾指出:“如果要在开源软件上取得成功,那么你需要服务于:(A)愿意花费时间来省钱的人;和(B)愿意花钱来节约时间的人。”如果说一个公司在开源方面取得了成功,那么它从开源软件的销售和服务方面获得的回报至少应该大于在研发和推广方面的投入。显而易见,某些用户之所以能够免费使用开源软件,一方面固然是因为他们的参与降低了开源软件在研发和推广方面的投入,另一方面则是因为付费用户为开源软件付出了更多的钱。

那么,为什么基于开源软件的解决方案通常要比其闭源的竞争对手更便宜呢?通常来说,闭源软件作为一个领域的开创者,在市场研究、产品设计、研发测试、推广销售等等环节都面临很大的挑战。开源软件作为闭源软件的追随者,在市场研究方面有闭源软件作为成功案例,在产品设计方面有闭源软件作为参考模板,在推广销售方面也得益于闭源软件的市场拓展。在研发方面,开源软件出现的时间要稍晚于闭源软件,在这个时间段里发生的技术进步会明显降低开源软件进入相关领域的门槛。除此之外,开源软件可能在某些特性方面超越闭源软件,但在总体水平上其功能的完备性、易用性、稳定性、可靠性会稍逊于闭源软件。因此,基于开源软件的解决方案通常会采取“以闭源软件30%的价格提供闭源软件80%的功能”这样的营销思路。除此之外,基于开源软件的解决方案的可定制性对于某些客户来说也有特别的吸引力。

在中国的商业环境中,IT公司(或者说互联网公司)通常是愿意花费时间来省钱的,而非IT公司(或者说传统行业)通常是愿意花钱来节约时间的。需要指出的是,中国的非IT公司往往不在乎软件是否开源,但是非常注重开源软件的可定制性。

开放源代码作为一种新的商业模式,并不比传统的闭源模式具有更高的道德水准。同理,在道德层面上对不同的开放源代码实践进行评判也是不合适的。在OpenStack项目的萌芽阶段,Rackspace公司的宣传文案声称OpenStack是“世界上唯一真正开放源代码的IaaS系统”。CloudStack、Eucalyptus和OpenNebula等具有类似功能的开源项目由于保留了部分闭源的企业版(2012年4 月以前,CloudStack项目和Eucalyptus均同时发布完全开源的社区版和部分闭源的企业版。2012年4 月之后,Eucalyptus项目宣布全面开源,CloudStack项目被Citrix收购并捐赠给Apache基金会后也全面开源。)、或者是仅向付费客户提供的自动化安装包(OpenNebula Pro是一个包含了增强功能的自动化安装包,但是其全部组件都是开放源代码的。)而被Rackspace归类为“不是真正的开放源代码项目”。类似的宣传持续了接近两年时间,直到Rackspace公司推出了基于OpenStack项目的Rackspace Private Cloud软件 — 一个性质上与OpenNebula Pro类似的自动化包。OpenNebula Pro是一个仅向付费用户提供的软件包,但是任何用户都可以免费地下载与使用Rackspace Private Cloud软件。问题在于,当用户所管理的节点数量超过20台服务器时,就需要向Rackspace公司寻求帮助(购买必要的技术支持)。这里我们暂且不讨论将节点数量限制为20台服务器这部分代码是否开源的问题。开源项目的发起者和主要贡献者在其重新打包的发行版中添加了限制该软件应用范围的功能,从道德层面来看很难解释,但是在商业层面来看就很正常。在过去两年中,OpenStack项目在研发、推广、社区等领域所采取的种种措施,都堪称是基于开放源代码的商业模式的经典案例。

前面我们提到,在同一领域往往存在多个相互竞争的开源项目。以广义上的云计算为例,除了我们熟悉的CloudStack、Eucalyptus、OpenNebula、OpenStack之外,还有Convirt、XenServer、Oracle VM、OpenQRM等等诸多选择。针对一个特定的应用场景,如何在众多的开源方案中进行选型呢?根据我个人的经验,可以将整个方案选型过程分为需求分析、技术分析、商务分析三个阶段。

(1)在需求分析阶段,针对特定的应用场景深入挖掘该项目采用云计算技术的真正目的。在中国,很多项目决策者对云计算的认识往往停留在“提高资源利用率、降低运维成本、提供更多便利”的阶段,并没有意识到这个列表已经是大部分开源软件均可提供的基本功能。除此之外,很多项目决策者缺省地将VMWare vCenter提供的全部功能作为对开源软件的要求,而没有考虑特定项目是否需要这些功能。因此,非常有必要针对特定的应用场景进行调研,明确将其按照数据中心虚拟化和狭义上的云计算归类,并进一步挖掘项目在功能上的具体要求。在很多情况下,数据中心虚拟化和狭义上的云计算均能够满足客户的总体需求,那么销售的任务就是将客户的具体需求往有利于自身的方向上引导。这个技巧,我们称之为客户期望值管理(Expectation Management)。通过需求分析,明确特定应用场景的分类,可以过滤掉一部分选项。

(2)在技术分析阶段,首先比较各个开源软件的参考架构,重点考虑在特定应用场景下按照参考构架进行实施所面临的困难。其次在功能的层面对各个开源软件进行对比,并将必须具备的功能(Must Have)和能够加分的功能(Good to Have)区别对待。除此之外,还可以对安装配置的难易程度、具体功能的易用性、参考文档的完备性、二次开发的可能性等等进行评估。通过技术分析,可以给各个开源软件打分排名,在此基础上可以淘汰掉得分最低的选项。

(3)在商务分析阶段,必须明确决策者是否愿意为开源的解决方案付费。如果决策者不愿意为付费,那么该项目就属于“愿意花费时间来省钱”的场景,反之则属于“愿意花钱来节约时间”的场景。对于愿意花费时间来省钱的应用场景,主要依赖于开源社区获得技术支持,可以将开源项目的社区活跃度作为重要的参考数据。对于愿意花钱来节省时间的应用场景,主要依赖于服务提供商获得技术支持,应该重点考察服务提供商在业界的影响力以及在本地的服务能力,开源项目的社区活跃度则显得无关紧要了。

在中国(狭义上)的云计算市场, 对于愿意付费的客户来说,CloudStack和Eucalyptus是值得优先考虑的选项。这两个项目的启动时间比较早,具有更好的稳定性和可靠性,在业界有较大的影响力,并且在国内有团队可以提供支持和服务。与此同时,国内一些创业团队开始提供基于OpenStack的解决方案,但是在短时间内很难积累必要的实战经验,而具备丰富经验的新浪SAE团队尚未开拓对外提供技术支持的业务。国内虽然也有一些单位在使用OpenNebula,但是在近期内很难形成对第三方提供技术服务的能力。对于愿意花时间的客户来说,CloudStack和OpenStack的优势较为明显,因为两者的社区活跃度相对较高。在这两者当中,CloudStack的功能更加丰富,也有更多的企业级客户以及成功案例,可能是短期内的更佳选择。从长远来看,基于OpenStack的解决方案会越来越流行,但是其他解决方案在技术和市场上也都在不断取得进步,因此在未来三年内很难形成一统天下的局面。单纯从商业上考虑,CloudStack和Eucalyptus获得成功的几率可能会更大一些。

G、其他

有些朋友希望我补充一些云计算在中国的现状。坦率地说,目前我尚不掌握充足的数据,在这里暂不展开论述。刘黎明(新浪微博@刘黎明3000)最近发布了一篇题为《点评阿里云盛大云代表的云计算IaaS产业》的文章,值得参考。

关于不同开源项目的社区活跃度比较,可以参考我最近的一篇博客文章《CY12-Q3 OpenStack, OpenNebula,Eucalyptus,CloudStack社区活跃度比较》。另外,我在《HP Cloud Services性能测试》一文中,也初步提出了一个对公有云进行性能评测的方法。

本文中的所有插图,全部来自Google搜索。除此之外,部分概念性内容参考了维基百科的相关条目进行了改写。

[repost ]Startups Are Creating A New System Of The World For IT

original:http://highscalability.com/blog/2012/5/7/startups-are-creating-a-new-system-of-the-world-for-it.html

It remains that, from the same principles, I now demonstrate the frame of the System of the World. – Isaac Newton

The practice of IT reminds me a lot of the practice of science before Isaac Newton. Aristotelianism was dead, but there was nothing to replace it. Then Newton came along, created a scientific revolution with his System of the World. And everything changed. That was New System of the World number one.

New System of the World number two was written about by the incomparable Neal Stephenson in his incredible Baroque Cycle series. It explores the singular creation of a new way of organizing society grounded in new modes of thought in business, religion, politics, and science. Our modern world emerged Enlightened as it could from this roiling cauldron of forces.

In IT we may have had a Leonardo da Vinci or even a Galileo, but we’ve never had our Newton. Maybe we don’t need a towering genius to make everything clear? For years startups, like the frenetically inventive age of the 17th and 18th centuries, have been creating a New System of the World for IT from a mix of ideas that many thought crazy at first, but have turned out to be the founding principles underlying our modern world of IT.

If you haven’t guessed it yet, I’m going to make the case that the New System of the World for IT is that much over hyped word: cloud. I hope to show, using many real examples from real startups, that the cloud is built on a powerful system of ideas and technologies that make it a superior model for delivering IT.

IT has had an explosion of creativity: open source, deep and powerful tool chains, lean and agile development, cloud computing, virtualization, BigData, parallel programming, distributed monitoring, distributed programming, NoSQL, cost driven programming, dynamic languages, real-time processing, asynchronous programming, distributed teams, mobile platforms, viral loops, flat networks, software defined networking, wimpy cores, DevOps, everything as a service, infrastructure as code, and so on and so on. Astounding innovation wherever you look.

We are just now figuring out what new structures and systems are replacing the old, but if you step back a bit, what seems to be happening is we are creating a new “frame” using a bottom up methodology that just may be a new System of the World for IT. What is merging is a new way of working synthesised from all the diverse forces catalogued above. We’ve created a sort of new physics of development in place of a collection of prescientific alchemical lore.

Since it is startups tackling problems that can’t be solved using traditional methods, it is through them that we’ll explore this new System of the World or IT.

It’s Not All About The Cloud, But It’s Mostly About The Cloud

These days the story of startups primarily revolves around the cloud in one way or another. Not completely, not totally, but usually. That’s my inescapable observation based on all thearchitecture profiles I’ve written on HighScalability.com. Most involve the cloud.

Not all startups choose the cloud, many do not, but even if a startup doesn’t join a formal cloud, we still see the development of cloud-like infrastructures and the deployment of cloud inspired tool chains. So we’ll just skip all the old arguments about OpEx vs CapEx, IaaS vs PaaS vs SaaS, virtualization vs bare metal, public vs private vs hybrid clouds, and open vs closed clouds. Those are all just business decisions made in the pursuit of business goals.

Which specific choices are made isn’t all that important, which is why I’ll use the term cloud in a generic sense. By cloud I do not mean any particular cloud provider or technology.  Zynga, for example, used Amazon extensively, now they’ve built their own cloud to have more control, use fewer servers, and save money. But what they built is still a cloud.

There is a line of controversy worth pursuing that goes something like this: the cloud is no different than what we have been doing in datacenters for years, so what’s the big deal? The cloud is certainly a systematization and productization of capabilities traditionally found in a well staffed datacenter. So in that way the cloud is nothing new.

The key differentiators between a cloud and a datacenter are often said to be multitenancy, geographical distribution, and elasticity. I want to say the key difference between a cloud and a datacenter is democratization. Where once only a few companies could leverage advanced datacenter services, now everyone, great and small can exploit the same capabilities. What was once private is now public. What was once specialized is now generic. What was once scarce is now abundant. Programmers jumped on all these new capabilities and turned them into the most sophisticated ecosystem for IT that we’ve ever seen. That’s a big deal.

So it is in cloud inspired features that a New System of the World can be found, not any particular instance of the cloud.

The Old Datacenter Versus The New Cloud

The quickest way I can think of to illustrate what the New System of the World for IT looks like is to consider the innovative work Netflix is doing in replacing their “in-house IT with the cloud for non-trivial applications with hundreds of developers and thousands of systems.”

Netflix is the poster child for moving from the datacenter to the cloud because they’ve actually done it. Netflix ran their own datacenter and are now 100% cloud. Along the way they’ve done a lot original thinking and on what it means to run an IT-centric business in the cloud.  Adrian Cockcroft, a Cloud Architect at Netflix, has created an amazing Cloud Architecture Tutorial documenting what they’ve learned.

What follows is a list of some major transitions Netflix has made in going from the datacenter to the cloud. The list is a synthesis of slides in the tutorial. It paints a clear picture of how IT in the cloud is different than IT in the datacenter:

Old Datacenter                                                                    New Cloud

Licensed and Installed Applications SaaS (Workday, Pagerduty, EMR)
Central SQL Database Distributed Key/Value NoSQL
Sticky In-Memory Session Shared Memory Cache Session
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service Patterns
Fat Complex Objects Lightweight Serialized Objects
Components as Jar Files Components as Services
Chatty Protocols Latency Tolerant Protocols
Manual and Static Tools Automated and Scalable Tools
SA/Database/Storage/Networking Admins NoOps/OpsDoneMaturelyButStillOps
Monolithic Software Development Teams Organized around Services
Monolithic Applications Building Your Own PaaS
Static and Slow Growing Capacity Incremental and Fast Growing Capacity
Heavy Process/Meetings/Tickets/Waiting Better Business Agility
Single Location Massive Geographical Distribution
Vendor Supply Chains Direct to Developer
Focus on How Much it Costs Focus on How Much Value it Brings
Ownership/CapEx Leasing/OpEx/Spot/Reserved/On Demand

 

Some principles we see at work are a move to distributed architectures, a focus on generating business value through agility and flexibility, a move away from ownership as a core competency, a separation of concerns along services boundaries, a decentralization and reorganization of processes around services, and a push of responsibility to as close to the developer as possible.

We’ll explore some of these ideas in later sections, but I think this makes it clear we aren’t just talking business as usual, when taken altogether we are talking about something new. It’s a complete transformation at every level.

If you want to say we can do all this in the datacenter I can’t argue, because clouds are built on datacenters. Though I would argue, that once a datacenter can do all these things, it has become a cloud.

The IT World Is Now Flat

Although the New System of the World was pioneered by startups, what has developed, strangely enough, serves to make any enterprise development group just as agile as any startup. The IT world has become flat. There’s now a level playing field across all of IT. The cloud has changed the core economic concepts of delivering business value on top of IT.

A small team in any company can recognize an opportunity, create a product within a week, have it run in many different locations worldwide, with almost no startup capital, and with a low sysadmin burden. Idea to innovation in the time it would have previously taken to work up a hardware request budget proposal.

For some time we’ve had practices like: agile development, extreme automation, short development iterations, continuous integration, continuous deployment, continuous testing, small dedicated teams, and so on. These practices, although much talked about, were seldom implemented.  What slowed adoption was a missing element: the cloud’s programmable IT fabric.

Previously a complex and highly specialized stack was required to follow the agile path. Now it’s easy for any group to develop software this way. And we’ve seen startup after startup adopt these strategies, creating a total revolution in practice on everything about how software is created, distributed, and maintained.

One reason for this revolution is explained by Etsy in terms of Conway’s Law:

When a team makes a product the product ends up resembling the team that made it.

I’ll extend this notion to say the team and thus the product end up resembling the underlying technology used to make it. When you change the underlying development infrastructure, by moving to a cloud, you are bound to change teams and processes they create.

Here are a few examples from startups of how pretty much everything has changed:

  • InstagramGive me a place to stand and with a lever I will move the whole world. An organization with 2 backend engineers can now scale a system to 30+ million users and be bought for a one billion dollars. Regardless of your opinion on the purchase price, the ability for a small organization to handle such a huge user base is an unprecedented amount of leverage.
  • Fidelity. Fidelity is not a startup, but they are creating a next generation internal cloud, saying that the cloud and BigData are creating new rules for IT organizations to innovate. No longer will they be hampered by the organization.
  • NetflixThere’s virtually no process at Netflix. They don’t believe in it. They don’t like to enforce anything. It slows progress and stunts innovation. They want high velocity development. Each team can do what they want and release whenever they want, how often they want. Teams release software all the time, independent of each other. They call this an “optimistic” approach to development.
  • Netflix: NoOps. “We have hundreds of developers using NoOps to get their code and datastores deployed in our PaaS and to get notified directly when something goes wrong. We have built tooling that removes many of the operations tasks completely from the developer, and which makes the remaining tasks quick and self service. There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done, and less time spent actually doing ops tasks than developers would spend explaining what needed to be done to someone else.”
  • EtsyContinuous deployment. Any engineer in Etsy can deploy the whole site to production at anytime. Happens 25 times a day because it’s so easy. It’s a one button deploy. Small change sets are going out all the time, not large deployments. If things go wrong they can quickly figure out what went wrong and fix it. Compare this to the infrequent big bang software updates that are typical.
  • EtsyQA is performed by developers.  Development makes production changes themselves. This has the effect of bringing them closer to production, which enables having an operability mindset. This is opposed to having a ship-to-QA-and-consider-it-done mindset. Developers deploying their own code also brings accountability, responsibility, and the requisite authority to influence production. No Operations engineers stand in the way of a Development engineer from deploying.
  • FacebookSmall, independent teams with both responsibility and control.  Small teams allow work to be done efficiently, quickly, and carefully. Only three people work on photos, for example, the largest photo site on the Internet. But responsibility requires control. If a team is responsible for something they must control it. For example, Facebook pushes code into production everyday. The person who wrote the code is there to fix anything that goes wrong. If the responsibility of pushing and wring code are split, then the code writer doesn’t feel the effect of code that breaks the system. Compare this to the typical separation of developers, QA, and DevOps.
  • FacebookMove Fast. At every level of scale there are surprises. Surprises are quickly dealt with using a highly qualified cross disciplinary team that is flexible and skilled enough to deal with anything that comes up. Flexibility is more important than any individual technical decision. By moving fast Facebook is also able to try more options and figure out which ones work best. Compare this to the typically heavy weight planning and development processes.
  • TripAdvisorNo architects, engineers work across the entire stack. You own your project end to end, and are responsible for design, coding, testing, monitoring. Most projects are 1-2 engineers. If you do not know something, you learn it. The only thing that gets in the way of delivering your project is you, as you are expected to work at all levels.  Compare this to the islands of specialization that are typical in IT.
  • AmazonYou build it, you run it. Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

This is not how software development has been done in the past. What makes it possible is the leverage gained by an IT programming fabric that treats a datacenter and its contained services as being software scriptable. From this base very powerful tool chains like Rightscale, Chef, Puppet, and dozens of others have been developed to make it possible for small teams to quickly do a lot with a little.

Reducing The Mean Time Between Big Ideas

The most startling change post cloud is the increasing pace of innovation. Netflix sees the cloud as laboratory for reducing the mean time between big ideas. James Urquhart says the cloud is a lottery system for developers. Developers can implement something quickly and cheaply, hope it gets 10 million users, hope it succeeds big and if it doesn’t, it wasn’t that expensive to fail.

The software startup landscape itself has been changed forever. Previously you would go to a VC for the many millions needed to even begin an idea, now you are expected to have a prototype ready before even seeing a VC.

Here are a few examples of how startups are making use of these new capabilities to innovate:

  • Robert Scoble talks about how the flood of new startups is just starting. Startups are starting all over the world, not  just Silicon Valley or New York. Now you can start a startup in the middle of nowhere India. The costs of starting a startup have gone way down. Ycombinator used to just have 10 companies a class come out, now they have 60. And each month there are more and more incubators. Two kids can start Instagram and they can start it anywhere.
  • TripAdvisor: Engineering can be best compared to running two dozen simultaneous startups, all working on the same code base and running in a common distributed computing environment. Each of these teams has their own business objectives, and each team is able to, and responsible for, all aspects of their business. Each of the teams operates in the way that best fits their distinct business and personal needs, this process is best described as “post agile/scrum”.
  • Netflix: Runs in Amazon so they can innovate and not have to worry about growth in the future.
  • Netflix: “We built a completely cloud based infrastructure in the US and did some work extracting it so we could actually deploy it anywhere. We set up a bunch of test machines in the AWS Ireland facility and we built the ability to replicate data across both sites. In total we set up 1,000 machines in Ireland. If we had built our own data centre then we would have had to lay down a large amount of money in, say, six months in advance for a really efficient build out, and instead we could use that money to buy movies.”
  • Playfish: The cloud allows Playfish to innovate and try new features and new game with very low friction, which is key in a fast moving market. The cloud allows them to concentrate on what makes them special, not building and managing servers.
  • Zynga: Zynga uses the cloud to deploy their applications and prove them out while handling the load during the process. They then fold applications back into their datacenter once the growth trajectory has been established. It’s not about saving money, it’s about growing business.
  • Steve Lacy: Amazon’s EC2 is a better ecosystem for fast iteration and innovation than Google’s internal cluster management system.  EC2 gives me reliability, and an easy way to start and stop entire services, not just individual jobs.

Typically a datacenter is a lock, a point of serialization for developers that creates a vertical barrier through the entire stack. By unshackling developers from IT infrastructure people it opens up the possibility space and developers can do new things they could never do before.

Of course, the distributed infrastructure of the Internet is essential to the low friction creation and dribution idea and the building of teams and sharing code. And the web and mobile are far more fertile niches for startups than any enterprise landscape. Yet the cloud, by creating an elastic usage model for all services developers, has unshackled developers. Developers can now be sure everything will just work without first having to ask permission. The entire cycle is now developer driven which has thrown an accelerant on the fire of innovation.

It’s Open Source All The Way Down

The foundations for this New System of the World sit squarely on Open Source software. There is virtually no startup you can name that is not built primarily on Open Source. Take a look atTumblr’s stack as a quick example: Linux, Apache, PHP, Scala, Ruby, Redis, HBase, MySQL, Varnish, HA-Proxy, nginx, Memcache, Gearman, Kafka, Kestrel, Finagle, Thrift, HTTP, Func, Git, Capistrano, Puppet, and Jenkins.

It’s all open source and Tumblr is by no means unique, this is a common pattern.

Open Source started with small libraries and has moved up stack with ever larger and more sophisticated components, applications, tools, languages, and operating systems. Now we are seeing movement into Open Source hardware, networking, and even Open Source clouds. At one time this was not true. At one time most software was developed with closed source tool chains. That has completely changed.

While Open Source was firmly established in the programming tools arena, LiveJournal was probably the early example of creating and open sourcing more sophisticated infrastructure tools like memcached and MogileFS. And possibly even more important was that they took the time to talk about the architecture challenges they faced and how they solved them. LiveJournal was the prototype for the early web.

This attitude helped create a virtuous circle in the development community, spawning a tradition that has continuously become more generous and more productive over time. Major companies like NetflixTwitterLinkedInGoogle, and Facebook are not only first to tackle scaling challenges, but they Open Source many of the solutions. And more importantly, they share their experiences and lessons learned with the whole community.

The impact of Open Source on productivity and innovation has been transformative. The advantage Open Source gives you is time. You can do more in less time. If you want to plug into this productivity cycle then you need to align yourself with the Open Source ecosystem. It’s not just for startups, it’s for anyone developing products. Use closed source where it offers a competitive advantage, but the fastest innovation is happening in the Open Source community and that’s with whom you want to make alliances.

It’s Loosely Coupled Services All The Way Down

If Open Source is the foundation for the New System of the World then Service Oriented Architectures are the load bearing walls. As we’ll see, services are not just a software architecture feature anymore, but they’ve become the organizing principle around how teams and software are constructed.

Services have been around forever. Client-server programming was invented as a way for applications to take advantage of networks of computers. This idea was lost on early web architectures that stuffed everything into two or three tier architectures. A browser talked to a web server that invoked code that would return a web page. That code might talk to a database, but it was always a monolithic self-contained blob. As web sites needed to scale, programmers rediscovered client-server programming and started breaking down monolithic applications into cooperating collections of services. Services started talking to other services and soon web servers weren’t application servers anymore, but just a thin layer around a set of service calls. The dependence of rich UIs and mobile applications on backend services has simply continued this evolution.

Here’s a how a number of startups are using Service Oriented Architectures:

  • Wordnik: “We’ve made a significant architectural shift. We have split our application stack into something called Micro Services. The idea is that you can scale your software, deployment and team better by having smaller, more focused units of software. The idea is simple — take the library (jar) analogy and push it to the nth degree. If you consider your “distributable” software artifact to be a server, you can better manage the reliability, testability, deployability of it, as well as produce an environment where the performance of any one portion of the stack can be understood and isolated from the rest of the system. Now the question of “whose pager should ring” when there’s an outage is easily answered! The owner of the service, of course.”
  • Playfish: Service Oriented Architectures are used at Playfish to manage complexity. As new games are added code is split into different components that are managed by different teams. This helps keep the overall complexity of the system down, which helps make everything easier to scale.
  • Amazon:
    • The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications. Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently. Grew into hundreds of services and a number of application servers that aggregate the information from the services.
    • Services are the independent units delivering functionality within Amazon. It’s also how Amazon is organized internally in terms of teams. If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas. Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit.
  • Amazon: “If you think about infrastructure as a service and platform as a service (PaaS), what we’ve built is a PaaS over the top of the AWS infrastructure, which is as thin a layer as we could build, leveraging as many Amazon features as seemed interesting and useful. Then we put a thin layer over that to isolate our developers from it.”
  • Netflix: “If you think about infrastructure as a service and platform as a service (PaaS), what we’ve built is a PaaS over the top of the AWS infrastructure, which is as thin a layer as we could build, leveraging as many Amazon features as seemed interesting and useful. Then we put a thin layer over that to isolate our developers from it.”
  • Netflix: Their architecture is service based. Many small teams of 3-5 person teams are completely responsible for their service: development, support, deployment. They are on the pager if things go wrong so they have every incentive to get it right. They’ve built a decoupled system where every service is capable of withstanding the failure of every service it depends on. Everyone is sitting in the middle of a bunch of supplier and consumer relationships and every team is responsible for knowing what those relationships are and managing them. It’s completely devolved — they don’t have any centralised control. They can’t provide an architecture diagram, it has too many boxes and arrows. There are literally hundreds of services running.
  • Facebook: Each layer is connected via well defined interface that is the sole entry point for accessing that service. This prevents nasty complicated interdependencies. Clients hide behind an application API. Applications use a data access layer. Application logic is encapsulated in application servers that provide an API endpoint. Application logic is implemented in terms of other services. The application server tier also hides a write-through cache as this is the only place user data is written or retrieved, it is the perfect spot for a cache.
  • Tumblr: Built a kind of Rails scaffolding, but for services. A template is used to bootstrap services internally. All services look identical from an operations perspective. Checking statistics, monitoring, starting and stopping all work the same way for all services.
  • Justin.tv: “The shift first started with the ascendancy of native mobile apps. Now, developers had to seriously start considering their HTTP APIs as first-class citizens and not nice-to-haves. Once that happened, it’s not a big leap to realize that treating your web application as somehow different from any of your native clients is a bit, well, insane.”

Now everything is kind of like it was before: service based, message passing based, distributed, real-time, queue based, and completely asynchronous. The tools to accomplish all this are different of course, but in principle they are similar.

What’s radically different from the past is the unification of services by rearchitecting entire products as a PaaS. This is made possible by a suite of scalable services linked together using a distributed IT fabric. Architectures can now be elastic and adaptive in ways that are still being explored.

Lifecycle Of A Project: Public Cloud To Private Cloud — Or Vice Versa — Or Both

New in this New System of the World is the idea of federated compute spaces that applicaton functionality can flow between depending on business objectives.

Zynga is the most famous practioner of this form of cloud thermodynamics. Zynga used the public Amazon cloud to deploy their applications, prove them out, and handle load during the initial phases of the release process. Then, once the growth trajectory had been established, they folded the application back into their own datacenter.

It wasn’t an architecture decision based on saving money, it was about growing the business. Zynga has matured and they are now moving off Amazon,  into their own private cloud, in search of lower costs and better performance, but they’ve created an enduring architectural pattern that will work for anyone.

The ability for a business to target business goals with this degree of risk management flexibility was virtually impossible in the rack’em and stack’em age.

Cost Driven Architectures

In the New System of the World how applications are architected has changed forever with the introduction of pay for use models like SaaS, PaaS, and IaaS.

Historically in programming the costs we talk about are time, space, latency, bandwidth, storage, person hours, etc. Infrastructure costs have been part of the capital budget. Someone ponies up for the hardware and software is then “free” until more infrastructure is needed. The dollar cost of software design isn’t usually an explicit factor considered.

Now software design decisions are part of the operations budget. Every algorithm decision you make will have dollar cost associated with it and it may become more important to craft algorithms that minimize operations cost across a large number of resources (CPU, disk, bandwidth, etc) than it is to trade off our old friends space and time.

Different resource costs will force very different design decisions. On Amazon do you use a spot instance, a reserved instance, or an on demand instance? Do you need a small or extra large or one of another dozen instance choices? Do you need to span multiple regions are is working across multiple availability zones acceptable? Should you build your own or used a built-in SaaS? Should you risk lock-in and use more of the built-in services are try to keep as independent as possible?

Just a few short years ago these are all issues you would never have considered before. A phase change has happened in architecture. Even if you aren’t in a public cloud it’s likely you’ll conceptualize your architecture in this way because that’s how the infrastructure tools will be patterned.

Flow Architectures – The Firehose

One of the consequences of using a Service Oriented Architecture is a lot of messages need to be targeted to a lot of different endpoints. And because in the cloud you aren’t standing up a few servers and nailing down connections between them anymore, you need a robust message bus to connect everything together.

The solution that has evolved is the Firehose. A firehose is a message bus that can handle elastic components, message queueing, fault isolation, asynchronous processing, low latency communication, and operating at a high scale.

Here are a few examples of startups using firehose architectures:

  • Tumblr:  Internally applications need access to the activity stream of information about users creating/deleting posts, liking/unliking posts, etc.  A challenge is to distribute so much data in real-time. An internal firehose was created as a message bus. Services and applications talk to the firehose via Thrift. LinkedIn’s Kafka is used to store messages. Internally consumers use an HTTP stream to read from the firehose. The firehose model is very flexible, not like Twitter’s firehose in which data is assumed to be lost. The firehose stream can be rewound in time and it retains a week of data. On connection it’s possible to specify the point in time to start reading. Multiple clients can connect and each client won’t see duplicate data. Each consumer in a consumer group gets its own messages and won’t see duplicates.
  • DataSift: Created an Internet scale filtering system that can quickly evaluate very large filters. It is essentially a giant firehose. 0mq is used for replication, message broadcasting, and round-robin workload distribution. Kafka (LinkedIN’s persistent and distributed message queue) is used for high-performance persistent queues.

An interesting architecture evolution we are seeing in the cloud is how systems continually reorganize themselves to give components better access to information flows. This allows services to be isolated yet still have access to all the information they need to carry out their specialized function. Before firehose style architectures the easiest path was to create monolithic applications because information was accessible only in one place. Now that information can flow freely and reliably between services, much more sophisticated architectures are possible.

Cell Architectures

Another consequence of Service Oriented Architectures is providing services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

Cell Architectures have several advantages:

  • Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
  • Cell are added in an incremental fashion as more capacity is required.
  • Cells isolate failures. One cell failure does not impact other cells.
  • Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.
  • Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
  • Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

A number of startups make use of Cell Architectures:

  • Tumblr: Users are mapped into cells and many cells exist per data center. Each cell has an HBase cluster, service cluster, and Redis caching cluster. Users are homed to a cell and all cells consume all posts via firehose updates. Background tasks consume from the firehose to populate tables and process requests. Each cell stores a single copy of all posts.
  • Flickr: Uses a federated approach where all a user’s data is stored on a shard which is a cluster of different services.
  • Facebook: The Messages service has as the basic building block of their system a cluster of machines and services called a cell. A cell consists of ZooKeeper controllers, an application server cluster, and a metadata store.
  • Salesforce: Salesforce is architected in terms of pods. Pods are self-contained sets of functionality consisting of 50 nodes, Oracle RAC servers, and Java application servers. Each pod supports many thousands of customers. If a pod fails only the users on that pod are impacted.

While the internal structure of a cell can be quite complex, the programmability of the cloud makes it relatively easy to configure, start, stop, failover and respond elastically to load.

Conclusion

We are still figuring out the New System of the World for IT. What was strange just a few years ago is now commonplace. Many discoveries and innovations wait to be made, it will never be complete, but the path has been set.

[repost ]The Conspecific Hybrid Cloud

original:http://highscalability.com/blog/2012/3/21/the-conspecific-hybrid-cloud.html

When you’re looking to add new tank mates to an existing aquarium ecosystem, one of the concerns you must have is whether a particular breed of fish is amenable to conspecific cohabitants. Many species are not, which means if you put them together in a confined space, they’re going to fight. Viciously. To the death. Responsible aquarists try to avoid such situations, so careful attention to the conspecificity of animals is a must.

Now, while in many respects the data center ecosystem correlates well to an aquarium ecosystem, in this case it does not. It’s what you usually get, today, but its not actually the best model. That’s because what you want in the data center ecosystem – particularly when it extends to include public cloud computing resources – is conspecificity in infrastructure.

This desire and practice is being seen both in enterprise data center decision making as well as in startups suddenly dealing with massive growth and increasingly encountering performance bottlenecks over which IT has no control to resolve.

OPERATIONAL CONSISTENCY

One of the biggest negatives to a hybrid architectural approach to cloud computing is the lack ofoperational consistency. While enterprise systems may be unified and managed via a common platform, resources and delivery services in the cloud are managed using very different systems and interfaces. This poses a challenge for all of IT, but is particularly an impediment to those responsible for devops – for integrating and automating provisioning of the application delivery services required to support applications. It requires diverse sets of skills – often those peculiar to developers such as programming and standards knowledge (SOAP, XML) – as well as those traditionally found in the data center.

quotes“We own the base, rent the spike. We want a hybrid operation. We love knowing that shock absorber is there.” – Allan Leinwand, Zynga’s Infrastructure CTO

Other bottlenecks were found in the networks to storage systems, Internet traffic moving through Web servers, firewalls’ ability to process the streams of traffic, and load balancers’ ability to keep up with constantly shifting demand.

Zynga uses Citrix Systems CloudStack as its virtual machine management interface superimposed on all zCloud VMs, regardless of whether they’re in the public cloud or private cloud.

Inside Zynga’s Big Move To Private Cloud by InformationWeek’s Charles Babcock

This operational inconsistency also poses a challenge in the codification of policies across the security, performance, and availability spectrum as diverse systems often require very different methods of encapsulating policies. Amazon security groups are not easily codified in enterprise-class systems, and vice-versa. Similarly, the options available to distribute load across instances required to achieve availability and performance goals are impeded by lack of consistent support for algorithms across load balancing services as well as differences in visibility and health monitoring that prevent a cohesive set of operational policies to govern the overall architecture.

Thus if hybrid cloud is to become the architectural model of choice, it becomes necessary to unify operations across all environments – whether public or enterprise.

UNIFIED OPERATIONS

We are seeing this demand more and more, as enterprise organizations seek out ways to integrate cloud-based resources into existing architectures to support a variety of business needs – disaster recover, business continuity, and spikes in application demand. What customers are demanding is a unified approach to integrating those resources, which means infrastructure providers must be able to offer solutions that can be deployed both in a traditional enterprise-class model as well as a public cloud environment.

This is also true for organizations that may have started in the cloud but are now moving to a hybrid model in order to seize control of the infrastructure as a means to address performance bottlenecks that simply cannot be addressed by cloud providers due to the innate nature of a shared model.

quotesThis ability to invoke and coordinate both private and public clouds is “the hidden jewel” of Zynga’s success, says Allan Leinwand, CTO of infrastructure engineering at the company.

– Lessons From FarmVille: How Zynga Uses The Cloud

While much is made of Zynga’s “reverse cloud-bursting” business model, what seems to be grossly overlooked is the conspecificity of infrastructure required in order to move seamlessly between the two worlds. Whether at the virtualization layer or at the delivery infrastructure layer, a consistent model of operations is a must to transparently take advantage of the business benefits inherent in a cross-environment, aka hybrid, cloud model of deployment.

As organizations converge on a hybrid model, they will continue to recognize the need and advantages of an operationally consistent model – and they are demanding it be supported. Whether it’s Zynga imposing CloudStack on its own infrastructure to maintain compatibility and consistency with its public cloud deployments or enterprise IT requiring public cloud deployable equivalents for traditional enterprise-class solutions, the message is clear: operational consistency is a must when it comes to infrastructure.

[repost ]Strategy: Cache Application Start State To Reduce Spin-Up Times

original:http://highscalability.com/blog/2011/4/14/strategy-cache-application-start-state-to-reduce-spin-up-tim.html

Using this strategy, Valyala, a commenter on Are Long VM Instance Spin-Up Times In The Cloud Costing You Money?, was able to reduce their GAE application start-up times from 15 seconds down to to 1.5 seconds:

Spin-up time for newly added Google AppEngine instances can be reduced using initial state caching. Usually the majority of spin-up time for the newly created GAE instance is spent in the pre-populating of the initial state, which is created from many data pieces loaded from slow data sources such as GAE’s datastore. If the initial state is identical among GAE instances, then the entire state can be serialized and stored in a shared memory (either in the memcache or in the datastore) by the first created instance, so newly created instances could load and quickly unserialize the state from a single blob loaded from shared memory instead of spending a lot of time for creation of the state from multiple data pieces loaded from the datastore.

I reduced spin-up time for new instances of my GAE application from 15 seconds to 1.5 seconds using this technique.

Theoretically the same approach could be used for VM-powered clouds such as Amazon EC2, if the cloud will be able fork()’ing new VMs from the given initial state. Then application developers could boot and pre-configure required services in the ‘golden’ VM, which then will be stored in a snapshot somewhere in a shared memory. The snapshot will be used for fast fork()’ing of new VMs. The VM’s fork() can be much faster comparing to the cold boot of a new VM with required services.

As another commenter noted, GAE now has an Always On feature, which keeps three instances of your app running, but the rub here is you have to pay for the resources you are using. This approach minimizes costs and works across different types of infrastructures.

I’ve successfully used similar approaches for automatically starting, configuring, and initializing in-memory objects across a cluster. In this architecture:

  • Each object has an ID that is mapped to a bag of attributes. Some of those attributes are configuration attributes, some are events, alarms, and dynamic attributes for holding current state.
  • On each node a software system is in charge of figuring out which objects are assigned to which nodes, creating all those objects, and running each object through a startup state machine which includes the object retrieving its state from the database and performing any other required initialization.
  • When all objects have moved to a ready state the node itself would be considered ready for service. The node status was sent to all other nodes which now knew they could use that node for service.

This works great. It minimizes the burden on the application programmer, makes node bring-up fast and easy, and feed directly into an automatic replication and fail-over system.

Related Articles