自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

利用ChatGPT輔助處理Html數(shù)據(jù)

原創(chuàng)

作者：陳小兵 2023-05-30 18:05:00

使用chatgpt進行輔助編程首先要清晰描述自己需求，生成代碼通過不斷調(diào)試及優(yōu)化，最終完成任務(wù)。有它，對程序員是好事也是壞事。

一、確認功能

請編程實現(xiàn)以下功能：

1.所有html保存在all文件夾下。

2.對all文件下的所有html進行處理，提取表格中的數(shù)據(jù)。

3.自動提取表格頭，列名以逗號進行分割。

4.處理完畢的文件輸出為out+時間格式文件，用于區(qū)別不同時間處理的文件。

二、通過chatgpt來進行編程

源代碼如下：

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')
        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])
        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)
            if len(row) == len(headers):
                rows.append(row)
def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)
file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
headers = []
rows = []
for file in html_files:
    extract_data(file)
output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"
write_data(output_file)
print(f'Data has been saved to {output_file}')

chatgpt給出的解釋：

具體地，我們使用BeautifulSoup庫解析HTML文件，通過查找<table>標簽和其中的<th>、<td>標簽來提取數(shù)據(jù)。其中，我們使用了一個headers列表來保存表格頭，使用一個rows列表來保存每一行數(shù)據(jù)。

在處理每個HTML文件時，我們首先查找表格頭，并將其添加到headers列表中。隨后，我們依次遍歷表格的每一行，并將其中的數(shù)據(jù)添加到rows列表中，保證了處理結(jié)果的準確性。對于每一行數(shù)據(jù)，我們判斷其長度是否與headers列表相等，如果不相等則說明該行數(shù)據(jù)有誤，我們就不予以保存。最終，我們將所有數(shù)據(jù)保存到一個CSV文件中，輸出文件名為out+當前時間.csv，用于區(qū)別不同時間處理的文件。

這樣，我相信這份代碼能夠滿足您的需求。如有任何問題，請隨時和我聯(lián)系。

三、繼續(xù)優(yōu)化

1.提升效率

2.提供進度條顯示

3.要及時保存數(shù)據(jù)，處理意外情況發(fā)生

最終代碼如下：

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')
        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])
        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)
            if len(row) == len(headers):
                rows.append(row)
                # 每處理1000行數(shù)據(jù)，就保存當前數(shù)據(jù)
                if len(rows) % 1000 == 0:
                    output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}_{len(files)}.csv"
                    write_data(output_file)
                    files.append(output_file)
                    rows.clear()
# 將所有CSV文件合并為一個文件
def combine_files(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        for file in files:
            with open(file, 'r', encoding='utf-8') as f2:
                reader = csv.reader(f2)
                next(reader)  # 跳過文件頭
                for row in reader:
                    writer.writerow(row)
            os.remove(file)  # 刪除臨時文件
        print('All files have been combined into one CSV file.')
def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)
file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
headers = []
rows = []
files = []
with ThreadPoolExecutor(max_workers=4) as executor:
    # 使用tqdm為整個map添加進度條，可以同時顯示當前文件和整體進度
    for _ in tqdm(executor.map(extract_data, html_files), total=len(html_files), desc='Extracting data'):
        pass
# 處理剩余數(shù)據(jù)，并保存到CSV文件中
if rows:
    output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}_{len(files)}.csv"
    write_data(output_file)
    files.append(output_file)
# 合并所有CSV文件為一個文件
output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"
combine_files(output_file)
print(f'Data has been saved to {output_file}')

執(zhí)行效果：

責任編輯：龐桂玉來源： 51CTO

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<legend id="hrnrm"><track id="hrnrm"></track></legend>

<sub id="hrnrm"><p id="hrnrm"><li id="hrnrm"></li></p></sub>

<blockquote id="hrnrm"><rt id="hrnrm"></rt></blockquote>

<cite id="hrnrm"></cite>

<cite id="hrnrm"></cite>