我眼中的DevOps
Over the last year or so a bunch of presumptuous European sysadmins and developers, joined by some of their American brethren and even a couple of us antipodeans (there are others too!) have been talking about a concept called DevOps. DevOps is the merger of the realms of development and operations (and if truth be told elements of product management,QA, and *winces* even sales should be thrown into the mix too).
過去一年以來,一批來自歐美的、不墨守陳規(guī)的系統(tǒng)管理員和開發(fā)人員一直在談?wù)撘粋€(gè)新概念:DevOps。DevOps 就是開發(fā)(Development) 和運(yùn)維(Operations)這兩個(gè)領(lǐng)域的合并。(如果沒錯(cuò)的話,DevOps還包括產(chǎn)品管理、QA、*winces* 甚至銷售等領(lǐng)域)
The Broken
脫節(jié)(The Broken)
So … why should we merge or bring together the two realms? Well there are lots of reasons but first and foremost because what we’re doing now is broken. Really, really broken. In many shops the relationship between development (or engineering) and operations is dysfunctional to the point of occasional toxicity.
那么……為什么要合并這兩個(gè)領(lǐng)域?原因很多,但首要原因是:我們目前的工作流程是脫節(jié)的。絕對(duì)的脫節(jié)。很多公司的開發(fā)部門和運(yùn)維部門之間存在的深刻矛盾,其實(shí)就是這個(gè)“脫節(jié)”造成的。(意譯,求斧正)
Here’s an example I think everyone will be at least partially familiar with: the minefield that is project to production software deployment. Curse along as I explain.
下面是一個(gè)大家都基本熟悉的例子:部署軟件產(chǎn)品。
Development builds an application, the new hotness which promises customers all the whizz-bang features and will make the company millions. It is built using cutting edge technology and a brand new platform and it has got to be delivered right now. Development cuts code like crazy and gets the product ready for market ahead of schedule. They throw their masterpiece over the fence to Operations to implement and dash off to the pub for the wrap party.
開發(fā)部門要開發(fā)一款新產(chǎn)品。這款產(chǎn)品要使用最新最炫的技術(shù),來保證客戶的所有花俏的需求,從而給公司帶來百萬美元的利潤(rùn)。這款產(chǎn)品被要求使用最新的技術(shù)和運(yùn)行平臺(tái),還得馬上交付。于是開發(fā)部門沒日沒夜的加班、趕代碼(cuts code like crazy),終于如期完成了任務(wù)。然后他們把自己的“杰作”一股腦的甩給了運(yùn)維部門,后者還沒能完全接手,前者已經(jīng)迫不及待的開始了慶功會(huì)。
Operations catches the deployment and is filled with horror.
接到產(chǎn)品后,運(yùn)維部門每個(gè)人的心中都充滿了恐懼。
The Operations team summarises their horror and says one or more of:
下面就是運(yùn)維部門的恐懼之源:( {A.B.C} 表示 A 或 B 或 C 之一 )
* The wonder application won’t run on our infrastructure because {it’s too old, it doesn’t have capacity, we don’t support that version}
* 這款優(yōu)秀的產(chǎn)品在目前的底層平臺(tái)上無法運(yùn)行,因?yàn)檫@個(gè)平臺(tái){太古老了,空間不足,不支持某某版本}
* The architecture of the application doesn’t match our { storage, network, deployment, security } model
* 這款產(chǎn)品的體系結(jié)構(gòu)跟我們的{存儲(chǔ),網(wǎng)絡(luò),部署,安全}模型不匹配。
* We weren’t consulted about the { reporting, security, monitoring, backup, provisioning } and it can’t be “productionised”.
* 這款產(chǎn)品的{ 報(bào)告,安全,監(jiān)視,備份,服務(wù)提供} 我們搞不懂 ,所以沒法把它做成實(shí)際可用的產(chǎn)品。
But Operations persevere and install the new hotness – cursing and bitching throughout. Sadly, after forcing the application onto infrastructure and bending and twisting the architecture to get it running, the performance of the new application can be summed up as “epic fail”.
盡管伴隨著不絕于耳的抱怨和咒罵,運(yùn)維部門最終還是把這款產(chǎn)品安裝好了。不幸的是,由于做了很多蹩腳的修改和不合理的強(qiáng)迫式運(yùn)行,這款產(chǎn)品的性能最后被歸結(jié)為:終極失?。‥pic Fail)。
Operations sighs and starts logging problems and passing issues back to the Development team. Their responses generally come from the following pool:
于是非常沮喪的運(yùn)維部門開始記錄各種問題,源源不斷的給開發(fā)部門提Issue。而開發(fā)部門的回應(yīng)基本上都是:
* It’s not our fault – our code is perfect – it’s just been poorly implemented
* 這不是我們的錯(cuò) —— 我們的代碼非常完美——而是(運(yùn)維部門的)部署做的太差勁了。
* Operations are stupid and don’t understand the new hotness – why can’t they implement the cutting edge technology? Why are they so backward?
* 運(yùn)維部門比較笨,他們不懂新技術(shù)—— 為什么他們沒法實(shí)現(xiàn)最新的技術(shù)呢?為什么他們這么落伍呢?
* It runs fine on my machine…
* 在我的機(jī)器上運(yùn)行的沒問題啊……
The interactions between teams quickly becomes a toxic blame storm. The customers (and by extension the shareholders, investors and management) then become the losers. The loop gets closed with the company losing bucket loads of money and everyone losing their jobs. EPIC and FAIL.
兩個(gè)部門之間的交流很快變成了一場(chǎng)暴風(fēng)驟雨??蛻簦ㄒ约肮蓶|、投資方和管理層)則成了蒙受損失的失敗方。最終公司損失了無數(shù)的金錢,大家也都失業(yè)了。終極的失敗。
What’s different about DevOps?
DevOps 又有啥不同?它有什么好處?
DevOps is all about trying to avoid that epic failure and working smarter and more efficiently at the same time. It is a framework of ideas and principles designed to foster cooperation, learning and coordination between development and operational groups. In a DevOps environment, developers and sysadmins build relationships, processes, and tools that allow them to better interact and ultimately better service the customer.
DevOps 就是想方設(shè)法的避免這種“終極失敗”,同時(shí)讓大家用更聰明更有效的方式去工作。它是一種框架,包含了很多優(yōu)秀想法和原則,它鼓勵(lì)開發(fā)部門和運(yùn)維部門通力合作。在DevOps環(huán)境中,開發(fā)人員和系統(tǒng)管理員會(huì)構(gòu)建一些關(guān)系、流程和工具,從而更好的與客戶互動(dòng),最終提供更好的服務(wù)。
DevOps is also more than just software deployment – it’s a whole new way of thinking about cooperation and coordination between the people who make the software and the people who run it. Areas like automation, monitoring, capacity planning & performance, backup & recovery, security, networking and provisioning can all benefit from using a DevOps model to enhance the nature and quality of interactions between development and operations teams.
DevOps 也不僅僅是一種軟件的部署方法。它通過一種全新的方式,來思考如何讓軟件的作者(開發(fā)部門)和運(yùn)營(yíng)者(運(yùn)營(yíng)部門)進(jìn)行合作與協(xié)同。使用了DevOps模型之后,會(huì)使兩個(gè)部門更好的交互,使兩者的關(guān)系得到改善,從而讓很多領(lǐng)域從中受益,例如:自動(dòng)化、監(jiān)視、能力規(guī)劃和性能、備份與恢復(fù)、安全、網(wǎng)絡(luò)以及服務(wù)提供(provisioning)等等。
Everyone in the DevOps community has a slightly different take on “What is DevOps?” We all bring different experiences and focuses to the problem space. I personally see DevOps as having four quadrants:
“對(duì)于DevOps是什么?” 這個(gè)問題,DevOps社區(qū)中的每個(gè)人的回答都不盡相同。因?yàn)槲覀兊墓ぷ鹘?jīng)驗(yàn)不同,關(guān)注的問題也不同。就我個(gè)人而言,DevOps分成四大部分:
Simplicity
簡(jiǎn)單
KISS is King and in that vein this section is simple too. Design simple, repeatable, and reusable solutions. Simplicity saves documentation, training, and support time. Simplicity increases the speed of communication, avoids confusion, and helps reduces the risk of development and operational errors. Simplicity gets you to the pub faster.
KISS(Keep it Simple and Stupid,簡(jiǎn)單就是美)原則是最重要的。所以本段文字也很簡(jiǎn)單。我們要盡量提供簡(jiǎn)單、可重用的解決方案。“簡(jiǎn)單”節(jié)約了書寫文檔、培訓(xùn)和提供支持的時(shí)間。“簡(jiǎn)單”增加了溝通的速度、避免混淆、減少了開發(fā)和運(yùn)維出錯(cuò)時(shí)的風(fēng)險(xiǎn)。“簡(jiǎn)單”讓人更快的發(fā)布產(chǎn)品。
Relationships
部門之間關(guān)系
Engage early, engage often. Development teams need to embed operations people into their project and development life cycles. Invite operational people to your scrum or development meetings. Share ideas and information about product plans and new technologies. Gather operational requirements when gathering functional ones. As a project progresses test deployment, backup, monitoring, security and configuration management as well as application functionality. The more issues you fix during the project the less issues you expose your customers to when the application is live. Educate operations people about the applications architecture and the code base. The more information operations people can feed you about a problem with the code the less trouble-shooting you need to perform and the faster the problem can be fixed.
早參與,多參與。對(duì)于開發(fā)人員,要讓運(yùn)維人員常駐到開發(fā)部門,全程參與開發(fā)流程。邀請(qǐng)運(yùn)維人員參與你的Scrum或者開發(fā)會(huì)議,與他們分享項(xiàng)目計(jì)劃、分享新技術(shù)的點(diǎn)子和心得。搜集功能性需求(指開發(fā)人員用到的需求)的同時(shí)也要搜集運(yùn)維方面的需求。把對(duì)于“發(fā)布、備份、監(jiān)控、安全、配置管理和系統(tǒng)功能”的測(cè)試作為一項(xiàng)獨(dú)立的項(xiàng)目流程。軟件產(chǎn)品在開發(fā)時(shí)解決的問題越多,那么在使用時(shí)暴露給用戶的問題就越少。給運(yùn)維人員做培訓(xùn),讓他們弄清楚項(xiàng)目的體系結(jié)構(gòu)和核心代碼。如果運(yùn)維人員在反饋bug時(shí)提供的信息越多,那么你花在排查問題(trouble-shooting) 的時(shí)間就越少,這個(gè)bug也就會(huì)更快的被解決掉。
Operations people need to bring development people into the problem and change management space. Invite developers into your team meetings. Share your roadmaps and upgrade plans. Understand where future development is heading to better ensure infrastructure deployments match product requirements. Developers also bring skills, knowledge and tools that can help make your environment easier to manage, more efficient and cleaner. Learn to code or if you’re a hack-n-slash systems programmer like me then learn to code better.
Concepts like building tools with APIs rather than closed interfaces, distributed version control, test driven development, and methodologies like Agile Development, Kanban and Scrum can revolutionise operational practises in the same way they’ve changed the way code is cut.
對(duì)于運(yùn)維人員,在遇到問題時(shí)需要把開發(fā)人員加進(jìn)來,大家一起解決問題。邀請(qǐng)開發(fā)人員參與你們的會(huì)議,分享項(xiàng)目進(jìn)度(roadmaps),并且共同修訂工作計(jì)劃。運(yùn)維人員一定要了解開發(fā)部門下一步的工作方向,從而確保產(chǎn)品運(yùn)行的底層平臺(tái)能夠良好的支持最新技術(shù)。開發(fā)人員也會(huì)帶來相關(guān)的技術(shù)、知識(shí)和工作,幫助你們改善產(chǎn)品的運(yùn)行環(huán)境,使其更加易于維護(hù)、簡(jiǎn)潔有效。
有一些開發(fā)領(lǐng)域的概念,例如:“要根據(jù)API而非封閉的interface來構(gòu)建工具”,分布式版本控制,驅(qū)動(dòng)測(cè)試開發(fā),以及諸如敏捷開發(fā)、看板管理(Kanban) 和Scrum等方法論。如果把這些概念應(yīng)用在運(yùn)維領(lǐng)域,同樣會(huì)產(chǎn)生革命性的變革。
Don’t be afraid of ideas and approaches from outside your domain – we can all learn things, even if it’s “let’s never do it that way again…!”, from how others do things and ultimately? Guess what? Yep, we’re all on the SAME team.
不要懼怕新點(diǎn)子和新技術(shù)。我們可以隨時(shí)隨地的向他人學(xué)習(xí),哪怕是一句“我們?cè)僖膊灰菢幼隽耍?rdquo; 也會(huì)讓我們從中獲益。盡管處于不同的部門,但是我們要共同學(xué)習(xí)、共同成長(zhǎng),這樣才能協(xié)同工作的更好!
Remember that interactions between people rank, in decreasing order of effectiveness (in IMHO but backed by some research):
按照從高到低的順序,有效的溝通方式應(yīng)該是:
1. Face to face
1. 面對(duì)面交流
2. Video conference
2. 視頻會(huì)議
3. Phone
3. 電話
4. IM & IRC
4. 即時(shí)通訊軟件
5. Email
5. Email.
Process
工作中的流程
Don’t underestimate the power of process and automation. Many shops do process engineering – ranging from hand-written lists to ISO9001. Those processes generally have one key flaw: they focus on the outcome and its inevitability. A simple process might provision a host – Step 1 install machine, Step 2 cable machine, Step 3 install OS, etc, etc. Assuming all goes to process then at the end of Step x you will have a fully provisioned host. But what happens if it doesn’t go right? If your process breaks or you receive some anomalous output how does your process deal with it?
Instead think about process as a journey and map out the potential pitfalls and obstacles. Treat your processes like applications and build error handling into them. You can’t predict every application or operational pitfall or issue but you can ensure that if you hit one your process isn’t derailed.
不要低估流程和自動(dòng)化的作用。很多公司都有自己的流程管理(process engineering)—— 從原始的筆錄到 ISO9001。但它們都存在一個(gè)關(guān)鍵的缺陷:過于理想化,它要求每個(gè)步驟都必須成功執(zhí)行。例如:為了搭建一臺(tái)新主機(jī),會(huì)有下列一套簡(jiǎn)單的流程:步驟一:裝機(jī)(把各個(gè)硬件組裝到一起)。步驟二:接線、通電。步驟三:安裝操作系統(tǒng)。接下來還有步驟四、五、六。如果一切順利的話,第N步結(jié)束之后就會(huì)有一個(gè)功能完整、運(yùn)行正常的新主機(jī)。但萬一有個(gè)流程沒跑通怎么辦?比如說在某個(gè)步驟斷了,走不下去了,或者在這一步得到了異常的輸出,有沒有另外的步驟來處理這個(gè)異常?
所以,流程絕對(duì)不會(huì)從頭到尾一帆風(fēng)順,所以我們要把每一步流程都認(rèn)真對(duì)待,找出所有潛在的問題和障礙。跟軟件產(chǎn)品一樣,在流程的管理中也要有異常處理。我們不必做到精確預(yù)見每一個(gè)問題,但一定要保證:即使流程出錯(cuò),它還能往下走。
Link process together across domains – software deployment, monitoring, capacity planning and other “operational” processes have their start in the development world. Software deployment is the logical conclusion of the software development life cycle and should be viewed as such rather than a separate operational process. Another example is metrics and monitoring, it is hard to measure anything without understanding the baselines and assumptions made in the development domain. Joint processes also mean more opportunity for development and operations interaction, understanding and joint accountability. Finally, joint process development means single repositories for documentation and other opportunities for economies of scale.
把不同領(lǐng)域的所有流程串到一起。這些領(lǐng)域包括:部署、監(jiān)控、能力計(jì)劃(capacity planning) 等等。從邏輯上講,“部署”是軟件開發(fā)周期的最后一環(huán),所以它應(yīng)該屬于“開發(fā)流程”,而非“運(yùn)維流程”。另一個(gè)例子是度量和監(jiān)控。在開發(fā)領(lǐng)域,如果不理解底線標(biāo)準(zhǔn)和估算,就什么評(píng)估都做不了。把開發(fā)部門和運(yùn)維部門的流程銜接在一起,也會(huì)讓兩個(gè)部門更好的配合、相互理解、承擔(dān)共同的責(zé)任。最后還有個(gè)優(yōu)點(diǎn):文檔只需要一份而不是兩份(開發(fā)一份、運(yùn)維一份),從而節(jié)省了資金。
Automate, automate, automate. Build or make use of simple and extensible tools (make sure they have APIs and machine readable input and output – see James White’s Infrastructure Manifesto). Use tools like Puppet (or others) to manage your configuration. Remember to extend your automation umbrella cross-domain and end-to-end in your environment – manage development, testing, staging and production environments with the same tools and processes. Not only does this have economies of scale benefits in support and management but it means you can test deployment and management alongside functionality as your application and new codes rolls toward production.
自動(dòng)化,自動(dòng)化,還是自動(dòng)化。構(gòu)建或使用簡(jiǎn)單、可擴(kuò)展的工具(確保提供API, 機(jī)器可讀的輸入、輸出 -- 參考 James White的文章:Infrastructure Manifesto)。使用Puppet一類的工具做配置管理。要擴(kuò)展這些自動(dòng)化工具,使其能夠支持多個(gè)領(lǐng)域(開發(fā)領(lǐng)域和運(yùn)維領(lǐng)域),并且在產(chǎn)品的不同環(huán)境(開發(fā)環(huán)境、測(cè)試環(huán)境、發(fā)布環(huán)境和生產(chǎn)環(huán)境)中使用相同的工具(也叫end-to-end)。這樣不但會(huì)在產(chǎn)品支持和管理方面帶來經(jīng)濟(jì)效益,而且也可以在編寫新代碼的同時(shí),進(jìn)行產(chǎn)品的發(fā)布和管理。
Finally, when building process and automation always keep the KISS principle in mind. Complexity breeds opportunities for error. Build simple processes and tools that are easy to implement, manage and maintain.
最后,在構(gòu)建流程和自動(dòng)化時(shí),要把KISS原則牢記于心。越復(fù)雜就越易錯(cuò)。只有簡(jiǎn)單的流程和工具才易于實(shí)現(xiàn)、易于管理和易于維護(hù)。
Continuous Improvement
持續(xù)改進(jìn)
Don’t stop innovating and learning. Technology moves fast. So do customer requirements. Build continuous improvement and integration into your tools and processes. Here is a good place operations people can learn from (good) developers about practises like test-driven development. A good example here is to build tests for your software deployment process and infrastructure. They are often an application in their own right and should be developed and maintained correctly. Your monitoring could also be extended with behavioural testing to deliver better business value. Look at using development domain tools, like Hudson for example, to explore and measure the operational domain.
不要停止創(chuàng)新和學(xué)習(xí)。當(dāng)今技術(shù)發(fā)展的很快,客戶的需求也往往如此。把“持續(xù)改進(jìn)和持續(xù)集成” 加入到你的工具和流程中去,這也是運(yùn)維人員向(優(yōu)秀的)開發(fā)人員學(xué)習(xí)的好途徑,可以學(xué)到諸如測(cè)試驅(qū)動(dòng)開發(fā)等最佳實(shí)踐。例如:可以向你的部署流程中加入單元測(cè)試。做監(jiān)控時(shí)也應(yīng)該增加些行為測(cè)試,提高交付質(zhì)量。嘗試用開發(fā)領(lǐng)域中的工具(例如Hudson)在運(yùn)維領(lǐng)域中做些工作(例如瀏覽數(shù)據(jù)(explore)、測(cè)量性能(measure)等等)。
Learn from mistakes and from outages. Seek root cause aggressively AND cross-domain. If you have an outage and a post-incident review then bring development and operational teams together to review the incident. Sometimes some simple code refactoring can save making infrastructure changes. Work together to fix root cause, treat it with the same process you develop to conduct project to production software deployment, rather than relegating them to incident review reports or batting issues between teams.
要不斷的總結(jié)教訓(xùn)。要積極主動(dòng)的、在不同領(lǐng)域?qū)ふ义e(cuò)誤的根源。 一旦收到錯(cuò)誤報(bào)告,就果斷把開發(fā)小組和運(yùn)維小組找來,一起解決這個(gè)問題。有時(shí)候開發(fā)人員很簡(jiǎn)單的幾次代碼重構(gòu),就可以很好的避免底層運(yùn)行環(huán)境的改變,減少運(yùn)維人員的負(fù)擔(dān)??傊?,遇到問題時(shí),開發(fā)部門和運(yùn)維部門要密切配合、共同解決,而不是互相推諉、踢皮球。
Me
對(duì)我來說...
Finally, for me DevOps is about people and nature of the environment you want to work in. The best thing about the movement for me is that it is trying to foster behaviours and environments where people work together towards joint goals rather than at cross-purposes or at odds. That’s a world I’d much rather use my skills in.
最后,對(duì)我來說,DevOps 的主要內(nèi)容是:跟誰共同工作、如何共同工作。它最吸引我的地方就是致力于把不同部門不同分工的人召集到一起,共同努力解決問題。這樣的工作環(huán)境,是我所憧憬的樂園。