自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<style id="jxtla"></style>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

文本分析了4000萬條Stack Overflow討論帖，這些是程序員最推薦的編程書（附代碼）

作者：xixi、吳雙、Aileen編譯 2018-02-26 17:11:11

大數(shù)據(jù) 數(shù)據(jù)分析前端

程序員們都看什么書?他們會向別人推薦哪些書?本文作者分析了Stack Overflow上的4000萬條問答，找出了程序員們最常討論的書，同時非?？犊毓_了數(shù)據(jù)分析代碼。讓我們來看看作者是怎么說的吧。

程序員們都看什么書?他們會向別人推薦哪些書?

本文作者分析了Stack Overflow上的4000萬條問答，找出了程序員們最常討論的書，同時非?？犊毓_了數(shù)據(jù)分析代碼。讓我們來看看作者是怎么說的吧。

尋找下一本值得讀的編程書是一件很難，而且有風險的事情。

作為一個開發(fā)者，你的時間是很寶貴的，而看書會花費大量的時間。這時間其實你本可以用來去編程，或者是去休息，但你卻決定將其用來讀書以提高自己的能力。

所以，你應該選擇讀哪本書呢?我和同事們經(jīng)常討論看書的問題，我發(fā)現(xiàn)我們對于書的看法相差很遠。

幸運的是，Stack Exchange(程序員最常用的IT技術(shù)問答網(wǎng)站Stack Overflow的母公司)發(fā)布了他們的問答數(shù)據(jù)。用這些數(shù)據(jù)，我找出了Stack Overflow上4000萬條問答里，被討論最多的編程書籍，一共5720本。

在這篇文章里，我將詳細介紹數(shù)據(jù)獲取及分析過程，附有代碼。

我開發(fā)了dev-books.com來展示書籍推薦排序

讓我們放大看看這些最受歡迎的書

“被推薦次數(shù)最多的書是Working Effectively with Legacy Code《修改代碼的藝術(shù)》，其次是Design Pattern: Elements of Reusable Object-Oriented Software《設(shè)計模式：可復用面向?qū)ο筌浖幕A(chǔ)》。
雖然它們的名字聽起來枯燥無味，但內(nèi)容的質(zhì)量還是很高的。你可以在每種標簽下將這些書依據(jù)推薦量排序，如JavaScript, C, Graphics等等。這顯然不是書籍推薦的終極方案，但是如果你準備開始編程或者提升你的知識，這是一個很好的開端。”

——來自Lifehacker.com的評論

獲取和輸入數(shù)據(jù)

我從archive.org抓取了Stack Exchange的數(shù)據(jù)。(https://archive.org/details/stackexchange)

從最開始我就意識到用最常用的方式(如 myxml := pg_read_file(‘path/to/my_file.xml’))輸入48GB的XML文件到一個新建立的數(shù)據(jù)庫(PostgreSQL)是不可能的，因為我沒有48GB的RAM在我的服務(wù)器上，所以我決定用SAX程序。

所有的值都被儲存在這個標簽之間，我用Python來提取這些值：

def startElement(self, name, attributes):  
 if name == ‘row’:  
 self.cur.execute(“INSERT INTO posts (Id, Post_Type_Id, Parent_Id, Accepted_Answer_Id, Creation_Date, Score, View_Count, Body, Owner_User_Id, Last_Editor_User_Id, Last_Editor_Display_Name, Last_Edit_Date, Last_Activity_Date, Community_Owned_Date, Closed_Date, Title, Tags, Answer_Count, Comment_Count, Favorite_Count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)”,  
 (  
 (attributes[‘Id’] if ‘Id’ in attributes else None),  
 (attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else None),  
 (attributes[‘ParentID’] if ‘ParentID’ in attributes else None),  
 (attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else None),  
 (attributes[‘CreationDate’] if ‘CreationDate’ in attributes else None),  
 (attributes[‘Score’] if ‘Score’ in attributes else None),  
 (attributes[‘ViewCount’] if ‘ViewCount’ in attributes else None),  
 (attributes[‘Body’] if ‘Body’ in attributes else None),  
 (attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else None),  
 (attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else None),  
 (attributes[‘LastEditorDisplayName’] if ‘LastEditorDisplayName’ in attributes else None),  
 (attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else None),  
 (attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else None),  
 (attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else None),  
 (attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else None),  
 (attributes[‘Title’] if ‘Title’ in attributes else None),  
 (attributes[‘Tags’] if ‘Tags’ in attributes else None),  
 (attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else None),  
 (attributes[‘CommentCount’] if ‘CommentCount’ in attributes else None),  
 (attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else None)  
 )  
 );

在數(shù)據(jù)輸入進行了三天之后(有將近一半的XML在這段時間內(nèi)已經(jīng)被導入了)，我發(fā)現(xiàn)我犯了一個錯誤：我把“ParentId”寫成了“ParentID”。

但這個時候，我不想再多等一周，所以把處理器從AMD E-350 (2 x 1.35GHz)換成了Intel G2020 (2 x 2.90GHz)，但這并沒能加速進度。

下一個決定——批量輸入：

class docHandler(xml.sax.ContentHandler):  
 def __init__(self, cusor):  
 self.cusor = cusor;  
 self.queue = 0;  
 self.output = StringIO();  
   def startElement(self, name, attributes):  
 if name == ‘row’:  
 self.output.write(  
 attributes[‘Id’] + '\t` +   
 (attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else '\\N') + '\t' +   
 (attributes[‘ParentId’] if ‘ParentId’ in attributes else '\\N') + '\t' +   
 (attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else '\\N') + '\t' +   
 (attributes[‘CreationDate’] if ‘CreationDate’ in attributes else '\\N') + '\t' +   
 (attributes[‘Score’] if ‘Score’ in attributes else '\\N') + '\t' +   
 (attributes[‘ViewCount’] if ‘ViewCount’ in attributes else '\\N') + '\t' +   
 (attributes[‘Body’].replace('\\', '\\\\').replace('\n', '\\\n').replace('\r', '\\\r').replace('\t', '\\\t') if ‘Body’ in attributes else '\\N') + '\t' +   
 (attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else '\\N') + '\t' +   
 (attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else '\\N') + '\t' +   
 (attributes[‘LastEditorDisplayName’].replace('\n', '\\n') if ‘LastEditorDisplayName’ in attributes else '\\N') + '\t' +   
 (attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else '\\N') + '\t' +   
 (attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else '\\N') + '\t' +   
 (attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else '\\N') + '\t' +   
 (attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else '\\N') + '\t' +   
 (attributes[‘Title’].replace('\\', '\\\\').replace('\n', '\\\n').replace('\r', '\\\r').replace('\t', '\\\t') if ‘Title’ in attributes else '\\N') + '\t' +   
 (attributes[‘Tags’].replace('\n', '\\n') if ‘Tags’ in attributes else '\\N') + '\t' +   
 (attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else '\\N') + '\t' +   
 (attributes[‘CommentCount’] if ‘CommentCount’ in attributes else '\\N') + '\t' +   
 (attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else '\\N') + '\n'  
 );  
 self.

StringIO讓你可以用一個文件作為變量來執(zhí)行copy_from這個函數(shù)，這個函數(shù)可以執(zhí)行COPY(復制)命令。用這個方法，執(zhí)行所有的輸入過程只需要一個晚上。

好，是時候創(chuàng)建索引了。理論上，GiST Indexes會比GIN慢，但它占用更少的空間，所以我決定用GiST。又過了一天，我得到了70GB的加了索引的數(shù)據(jù)。

在試了一些測試語句后，我發(fā)現(xiàn)處理它們會花費大量的時間。至于原因，是因為Disk IO需要等待。使用SSD GOODRAM C40 120Gb會有很大提升，盡管它并不是目前最快的SSD。

我創(chuàng)建了一組新的PostgreSQL族群：

initdb -D /media/ssd/postgresq/data

然后確認改變路徑到我的config服務(wù)器(我之前用Manjaro OS)：

vim /usr/lib/systemd/system/postgresql.service  
Environment=PGROOT=/media/ssd/postgres 
PIDFile=/media/ssd/postgres/data/postmaster.pid

重新加載config并且啟動postgreSQL：

systemctl daemon-reload postgresql systemctl start 
postgresql

這次輸入數(shù)據(jù)用了幾個小時，但我用了GIN(來添加索引)。索引在SSD上占用了20GB的空間，但是簡單的查詢僅花費不到一分鐘的時間。

從數(shù)據(jù)庫提取書籍

數(shù)據(jù)全部輸入之后，我開始查找提到這些書的帖子，然后通過SQL把它們復制到另一張表：

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

下一步是找的對應帖子的連接：

CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

但這時候我發(fā)現(xiàn)StakOverflow代理的所有鏈接都如下所示:

rads.stackowerflow.com/[$isbn]/

于是，我建立了另一個表來保存這些連接和帖子：

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%http://rads.stackowerflow.com%'";

我使用常用的方式來提取所有的ISBN(國際標準書號)，并通過下圖方式提取StackOverflow的標簽到另外一個表：

regexp_split_to_table

當我有了最受歡迎的標簽并且做了統(tǒng)計后，我發(fā)現(xiàn)不同標簽的前20本提及次數(shù)最多的書都比較相似。

我的下一步：改善標簽。

方法是：在找到每個標簽對應的前20本提及次數(shù)最多的書之后，排除掉之前已經(jīng)處理過的書。因為這是一次性工作，我決定用PostgreSQL數(shù)組，編程語言如下：

SELECT *  
 , ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude ))  
 , ARRAY_UPPER(ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )), 1)   
 FROM (  
 SELECT *  
 , ARRAY[‘isbn1’, ‘isbn2’, ‘isbn3’] AS to_exclude   
 FROM (  
 SELECT   
 tag  
 , ARRAY_AGG(DISTINCT isbn) AS isbns  
 , COUNT(DISTINCT isbn)   
 FROM (  
 SELECT *   
 FROM (  
 SELECT   
 it.*  
 , t.popularity   
 FROM isbn_tags AS it   
 LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn   
 LEFT OUTER JOIN tags AS t on t.tag = it.tag   
 WHERE it.tag in (  
 SELECT tag   
 FROM tags   
 ORDER BY popularity DESC   
 LIMIT 1 OFFSET 0  
 )   
 ORDER BY post_count DESC LIMIT 20  
 ) AS t1   
 UNION ALL  
 SELECT *   
 FROM (  
 SELECT   
 it.*  
 , t.popularity   
 FROM isbn_tags AS it   
 LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn   
 LEFT OUTER JOIN tags AS t on t.tag = it.tag   
 WHERE it.tag in (  
 SELECT tag   
 FROM tags   
 ORDER BY popularity DESC   
 LIMIT 1 OFFSET 1  
 )   
 ORDER BY post_count   
 DESC LIMIT 20  
 ) AS t2   
 UNION ALL  
 SELECT *   
 FROM (  
 SELECT   
 it.*  
 , t.popularity   
 FROM isbn_tags AS it   
 LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn   
 LEFT OUTER JOIN tags AS t on t.tag = it.tag   
 WHERE it.tag in (  
 SELECT tag   
 FROM tags   
 ORDER BY popularity DESC   
 LIMIT 1 OFFSET 2  
 )   
 ORDER BY post_count DESC   
 LIMIT 20  
 ) AS t3   
 ...  
 UNION ALL  
   SELECT *   
 FROM (  
 SELECT   
 it.*  
 , t.popularity   
 FROM isbn_tags AS it   
 LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn   
 LEFT OUTER JOIN tags AS t on t.tag = it.tag   
 WHERE it.tag in (  
 SELECT tag   
 FROM tags   
 ORDER BY popularity DESC   
 LIMIT 1 OFFSET 78  
 )   
 ORDER BY post_count DESC   
 LIMIT 20  
 ) AS t79  
 ) AS tt   
 GROUP BY tag   
 ORDER BY max(popularity) DESC   
 ) AS ttt  
 ) AS tttt   
 ORDER BY ARRAY_upper(ARRAY(SELECT UNNEST(arr) EXCEPT SELECT UNNEST(la)), 1) DESC;

既然已經(jīng)有了所需要的數(shù)據(jù)，我開始著手建立網(wǎng)站。

建立網(wǎng)站

因為我不是一個網(wǎng)頁開發(fā)人員，更不是一個網(wǎng)絡(luò)用戶界面專家，所以我決定創(chuàng)建一個基于默認主題的十分簡單的單頁面app。

我創(chuàng)建了“標簽查找”的選項，然后提取最受歡迎的標簽，使每次查找都可以點擊相應選項來搜索。

我用長條圖來可視化搜索結(jié)果。嘗試了Hightcharts和D3(分別為兩個JavaScript數(shù)據(jù)可視化圖表庫)，但是他們只能起到展示作用，在用戶響應方面還存在一些問題，而且配置起來很復雜。所以我決定用SVG創(chuàng)建自己的響應式圖表，為了使圖表可響應，必須針對不同的屏幕旋轉(zhuǎn)方向?qū)ζ溥M行重繪。

var w = $('#plot').width();  
 var bars = "";var imgs = "";  
 var texts = "";  
 var rx = 10;  
 var tx = 25;  
 var max = Math.floor(w / 60);  
 var maxPop = 0;  
 for(var i =0; i < max; i ++){  
 if(i > books.length - 1 ){  
 break;  
 }  
 obj = books[i];  
 if(maxPop < Number(obj.pop)) {  
 maxPop = Number(obj.pop);  
 }  
 }  
   for(var i =0; i < max; i ++){  
 if(i > books.length - 1){  
 break;  
 }  
 obj = books[i];  
 h = Math.floor((180 / maxPop ) * obj.pop);  
 dt = 0;  
   if(('' + obj.pop + '').length == 1){  
 dt = 5;  
 }  
   if(('' + obj.pop + '').length == 3){  
 dt = -3;  
 }  
   var scrollTo = 'onclick="scrollTo(\''+ obj.id +'\'); return false;" "';  
 bars += '<rect id="rect'+ obj.id +'" class="cla" x="'+ rx +'" y="' + (180 - h + 30) + '" width="50" height="' + h + '" ' + scrollTo + '>';  
   bars += '<title>' + obj.name+ '</title>';  
 bars += '</rect>';  
   imgs += '<image height="70" x="'+ rx +'" y="220" href="img/ol/jpeg/' + obj.id + '.jpeg" onmouseout="unhoverbar('+ obj.id +');" onmouseover="hoverbar('+ obj.id +');" width="50" ' + scrollTo + '>';  
 imgs += '<title>' + obj.name+ '</title>';  
 imgs += '</image>';  
   texts += '<text x="'+ (tx + dt) +'" y="'+ (180 - h + 20) +'" class="bar-label" style="font-size: 16px;" ' + scrollTo + '>' + obj.pop + '</text>';  
 rx += 60;  
 tx += 60;  
 }  
   $('#plot').html(  
 ' <svg width="100%" height="300" aria-labelledby="title desc" role="img">'  
 + ' <defs> '  
 + ' <style type="text/css"><![CDATA['  
 + ' .cla {'  
 + ' fill: #337ab7;'  
 + ' }'  
 + ' .cla:hover {'  
 + ' fill: #5bc0de;'  
 + ' }'  
 + ' ]]></style>'  
 + ' </defs>'  
 + ' <g class="bar">'  
 + bars  
 + ' </g>'  
 + ' <g class="bar-images">'  
 + imgs  
 + ' </g>'  
 + ' <g class="bar-text">'  
 + texts  
 + ' </g>'  
 + '</svg>');

網(wǎng)頁服務(wù)失敗

Nginx 還是 Apache?

當我發(fā)布了 dev-books.com這個網(wǎng)站之后，它有了大量的點擊。而Apache卻不能讓超過500個訪問者同時訪問網(wǎng)站，于是我迅速部署并將網(wǎng)站服務(wù)器調(diào)整為Nginx。說實在的，我對于能有800個訪問者同時訪問這個網(wǎng)站感到非常驚喜!

責任編輯：未麗燕來源：大數(shù)據(jù)文摘

程序員開發(fā)者書

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<optgroup id="uiee5"><li id="uiee5"></li></optgroup>

<del id="uiee5"><option id="uiee5"><table id="uiee5"></table></option></del>