YOLO、SSD 和 Faster R-CNN 三種方案實(shí)現(xiàn)物體識別的對比
本文旨在開發(fā)一個(gè)能夠準(zhǔn)確檢測和分割視頻中物體的計(jì)算機(jī)視覺系統(tǒng)。我將使用最先進(jìn)的三種SoA(State-of-the-Art)方法:YOLO、SSD和Faster R-CNN,并評估它們的性能。然后,我通過視覺分析結(jié)果,突出它們的優(yōu)缺點(diǎn)。接下來,我根據(jù)評估和分析確定表現(xiàn)最佳的方法。我將提供一個(gè)鏈接,展示最佳方法在視頻中的表現(xiàn)。
1. YOLO(You Only Look Once)
YOLOv8等深度學(xué)習(xí)模型在機(jī)器人、自動(dòng)駕駛和視頻監(jiān)控等多個(gè)行業(yè)中變得至關(guān)重要。這些模型能夠?qū)崟r(shí)檢測物體,并對安全和決策過程產(chǎn)生影響。YOLOv8(You Only Look Once)利用計(jì)算機(jī)視覺技術(shù)和機(jī)器學(xué)習(xí)算法,以高速度和準(zhǔn)確性識別圖像和視頻中的物體。這使得高效且準(zhǔn)確的物體檢測成為可能,這在許多應(yīng)用中至關(guān)重要(Keylabs, 2023)。
實(shí)現(xiàn)細(xì)節(jié)
我創(chuàng)建了一個(gè)run_model函數(shù)來實(shí)現(xiàn)物體檢測和分割。該函數(shù)接收三個(gè)參數(shù)作為輸入:模型、輸入視頻和輸出視頻。它逐幀讀取視頻,并將輸入視頻的結(jié)果可視化到幀上。然后,注釋后的幀被保存到輸出視頻文件中,直到所有幀都被處理完畢或用戶按下“q”鍵停止處理。
我使用YOLO模型(yolov8n.pt,“v8”)進(jìn)行物體檢測,該模型顯示帶有檢測到的邊界框的視頻。同樣,對于物體分割,使用具有分割特定權(quán)重的YOLO模型(yolov8n-seg.pt)生成帶有分割物體的視頻。
def run_model(model, video, output_video):
model = model
cap = cv2.VideoCapture(video)
# Create a VideoWriter object
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
# Get frame width and height
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out = cv2.VideoWriter(output_video, fourcc, 20.0, (frame_width, frame_height))
if not cap.isOpened():
print("Cannot open camera")
exit()
while True:
# Capture frame-by-frame
ret, frame = cap.read()
if not ret:
print("No frame...")
break
# Predict on image
results = model.track(source=frame, persist=True, tracker='bytetrack.yaml')
frame = results[0].plot()
# Write the frame to the output video file
out.write(frame)
# Display the resulting frame
cv2.imshow("ObjectDetection", frame)
# Terminate run when "Q" pressed
if cv2.waitKey(1) == ord("q"):
break
# When everything done, release the capture
cap.release()
# Release the video recording
# out.release()
cv2.destroyAllWindows()
# Object Detection
run_model(model=YOLO('yolov8n.pt', "v8"), video=VIDEO, output_video=OUTPUT_VIDEO_YOLO_DET)
# Object Segmentation
run_model(model=YOLO('yolov8n-seg.pt', "v8"), video=VIDEO, output_video=OUTPUT_VIDEO_YOLO_SEG)
2. Faster R-CNN(基于區(qū)域的卷積神經(jīng)網(wǎng)絡(luò))
Faster R-CNN是一種最先進(jìn)的物體檢測模型。它有兩個(gè)主要組件:一個(gè)深度全卷積區(qū)域提議網(wǎng)絡(luò)和一個(gè)Fast R-CNN物體檢測器。它使用區(qū)域提議網(wǎng)絡(luò)(RPN),該網(wǎng)絡(luò)與檢測網(wǎng)絡(luò)共享全圖像卷積特征(Ren等,2015)。RPN是一個(gè)全卷積神經(jīng)網(wǎng)絡(luò),生成高質(zhì)量的提議。然后,F(xiàn)ast R-CNN使用這些提議進(jìn)行物體檢測。這兩個(gè)模型被組合成一個(gè)單一的網(wǎng)絡(luò),RPN指導(dǎo)在哪里尋找物體(Ren等,2015)。
(1) 使用Faster R-CNN進(jìn)行物體檢測
為了實(shí)現(xiàn)物體檢測,我創(chuàng)建了兩個(gè)函數(shù):get_model和detect_and_draw_boxes。get_model函數(shù)加載一個(gè)預(yù)訓(xùn)練的Faster R-CNN模型,該模型是torchvision庫的一部分,并在COCO數(shù)據(jù)集上使用ResNet-50-FPN骨干網(wǎng)絡(luò)進(jìn)行預(yù)訓(xùn)練。我將模型設(shè)置為評估模式。然后,detect_and_draw_boxes函數(shù)對單個(gè)視頻幀進(jìn)行物體檢測,并在檢測到的物體周圍繪制邊界框。它將幀轉(zhuǎn)換為張量并傳遞給模型。該模型返回預(yù)測結(jié)果,包括檢測到的物體的邊界框、標(biāo)簽和分?jǐn)?shù)。置信度分?jǐn)?shù)高于0.9的邊界框,以及指示類別和置信度分?jǐn)?shù)的標(biāo)簽被添加。
def get_model():
# Load a pre-trained Faster R-CNN model
weights = FasterRCNN_ResNet50_FPN_Weights.DEFAULT
model = fasterrcnn_resnet50_fpn(weights=weights, pretrained=True)
model.eval()
return model
def faster_rcnn_object_detection(model, frame):
# Transform frame to tensor and add batch dimension
transform = T.Compose([T.ToTensor()])
frame_tensor = transform(frame).unsqueeze(0)
with torch.no_grad():
prediction = model(frame_tensor)
bboxes, labels, scores = prediction[0]["boxes"], prediction[0]["labels"], prediction[0]["scores"]
# num = torch.argwhere(scores > 0.9).shape[0]
# Draw boxes and labels on the frame
for i in range(len(prediction[0]['boxes'])):
xmin, ymin, xmax, ymax = bboxes[i].numpy().astype('int')
class_name = COCO_NAMES[labels.numpy()[i] -1]
if scores[i] > 0.9: # Only draw boxes for confident predictions
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 3)
# Put label
label = f"{class_name}: {scores[i]:.2f}"
cv2.putText(frame, label, (xmin, ymin - 10), FONT, 0.5, (255, 0, 0), 2, cv2.LINE_AA)
return frame
# Set up the model
model = get_model()
# Video capture setup
cap = cv2.VideoCapture(VIDEO)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
# Get frame width and height
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out = cv2.VideoWriter(OUTPUT_VIDEO_FASTER_RCNN_DET, fourcc, 20.0, (frame_width, frame_height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
print("No frame...")
break
# Process frame
processed_frame = faster_rcnn_object_detection(model, frame)
# Write the processed frame to output
out.write(processed_frame)
# Display the frame
cv2.imshow('Frame', processed_frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release everything is finished
cap.release()
out.release()
cv2.destroyAllWindows()
(2) 使用Faster R-CNN進(jìn)行物體分割
為了實(shí)現(xiàn)物體分割,我創(chuàng)建了函數(shù)來加載預(yù)訓(xùn)練的Mask R-CNN模型、預(yù)處理視頻幀、應(yīng)用分割并將掩碼覆蓋在幀上。首先,我使用從torchvision庫加載的預(yù)訓(xùn)練Mask R-CNN模型,該模型具有ResNet-50-FPN骨干網(wǎng)絡(luò),并將其設(shè)置為評估模式。我在COCO數(shù)據(jù)集上訓(xùn)練了該模型。然后,preprocess_frame函數(shù)對每個(gè)視頻幀進(jìn)行預(yù)處理并將其轉(zhuǎn)換為張量。接下來,apply_segmentation函數(shù)對預(yù)處理后的幀應(yīng)用分割過程,overlay_masks函數(shù)將分割掩碼覆蓋在幀上,繪制邊界框,并為置信度較高的檢測添加標(biāo)簽。這涉及通過置信度閾值過濾檢測結(jié)果、覆蓋掩碼、繪制矩形和添加文本標(biāo)簽。
# Load the pre-trained Mask R-CNN model
model = maskrcnn_resnet50_fpn(pretrained=True)
model.eval()
# Function to overlay masks and draw rectangles and labels on the frame
def faster_rcnn_object_segmentation(frame, threshold=0.9):
# Function to preprocess the frame
transform = T.Compose([T.ToTensor()])
frame_tensor = transform(frame).unsqueeze(0)
with torch.no_grad():
predictions = model(frame_tensor)
labels = predictions[0]['labels'].cpu().numpy()
masks = predictions[0]['masks'].cpu().numpy()
scores = predictions[0]['scores'].cpu().numpy()
boxes = predictions[0]['boxes'].cpu().numpy()
overlay = frame.copy()
for i in range(len(masks)):
if scores[i] > threshold:
mask = masks[i, 0]
mask = (mask > 0.6).astype(np.uint8)
color = np.random.randint(0, 255, (3,), dtype=np.uint8)
overlay[mask == 1] = frame[mask == 1] * 0.5 + color * 0.5
xmin, ymin, xmax, ymax = boxes[i].astype('int')
class_name = COCO_NAMES[labels[i] - 1]
# Draw rectangle
cv2.rectangle(overlay, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
# Put label
label = f"{class_name}: {scores[i]:.2f}"
cv2.putText(overlay, label, (xmin, ymin - 10), FONT, 0.5, (255, 0, 0), 2, cv2.LINE_AA)
return overlay
# Capture video
cap = cv2.VideoCapture(VIDEO)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
# Get frame width and height
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out = cv2.VideoWriter(OUTPUT_VIDEO_FASTER_RCNN_SEG, fourcc, 20.0, (frame_width, frame_height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
print("No frame...")
break
# Overlay masks
processed_frame = faster_rcnn_object_segmentation(frame)
# Write the processed frame to output
out.write(processed_frame)
# Display the frame
cv2.imshow('Frame', processed_frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release everything is finished
cap.release()
out.release()
cv2.destroyAllWindows()
3. SSD(單次多框檢測器)
SSD,即單次多框檢測器,是一種使用單一深度神經(jīng)網(wǎng)絡(luò)在圖像中進(jìn)行物體檢測的方法。它將邊界框的輸出空間離散化為每個(gè)特征圖位置上具有不同縱橫比和尺度的一組默認(rèn)框。在預(yù)測過程中,網(wǎng)絡(luò)為每個(gè)默認(rèn)框中每個(gè)物體類別的存在生成分?jǐn)?shù),并調(diào)整框以更好地匹配物體形狀。SSD結(jié)合了來自不同分辨率的多個(gè)特征圖的預(yù)測,以有效處理各種大小的物體,消除了提議生成和重采樣階段的需要,從而簡化了訓(xùn)練過程并集成到檢測系統(tǒng)中(Liu等,2016)。
(1) 使用SSD進(jìn)行物體檢測
我創(chuàng)建了一個(gè)ssd_object_detection函數(shù),該函數(shù)使用預(yù)訓(xùn)練的SSD模型,處理視頻幀,應(yīng)用檢測并在檢測到的物體周圍繪制邊界框,以實(shí)現(xiàn)使用SSD(單次多框檢測器)模型的物體檢測。
# Load the pre-trained SSD model
model = ssd300_vgg16(pretrained=True)
model.eval()
def ssd_object_detection(frame, threshold=0.5):
# Function to preprocess the frame
transform = T.Compose([T.ToTensor()])
frame_tensor = transform(frame).unsqueeze(0)
with torch.no_grad():
predictions = model(frame_tensor)
labels = predictions[0]['labels'].cpu().numpy()
scores = predictions[0]['scores'].cpu().numpy()
boxes = predictions[0]['boxes'].cpu().numpy()
for i in range(len(boxes)):
if scores[i] > threshold:
xmin, ymin, xmax, ymax = boxes[i].astype('int')
class_name = COCO_NAMES[labels[i] - 1]
# Draw rectangle
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
# Put label
label = f"{class_name}: {scores[i]:.2f}"
cv2.putText(frame, label, (xmin, ymin - 10), FONT, 0.5, (255, 0, 0), 2, cv2.LINE_AA)
return frame
# Capture video
cap = cv2.VideoCapture(VIDEO)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
# Get frame width and height
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out = cv2.VideoWriter(OUTPUT_VIDEO_SSD_DET, fourcc, 20.0, (frame_width, frame_height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
print("No frame...")
break
# Overlay masks
processed_frame = ssd_object_detection(frame)
# Write the processed frame to output
out.write(processed_frame)
# Display the frame
cv2.imshow('Frame', processed_frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release everything is finished
cap.release()
out.release()
cv2.destroyAllWindows()
(2) 使用SSD進(jìn)行物體分割
同樣,我創(chuàng)建了ssd_object_segmentation函數(shù),該函數(shù)加載預(yù)訓(xùn)練模型,處理視頻幀,應(yīng)用分割并在檢測到的物體上繪制掩碼和標(biāo)簽,以實(shí)現(xiàn)物體分割。
# Load the pre-trained SSD model
model = ssd300_vgg16(pretrained=True)
model.eval()
def ssd_object_segmentation(frame, threshold=0.5):
# Function to preprocess the frame
transform = T.Compose([T.ToTensor()])
frame_tensor = transform(frame).unsqueeze(0)
with torch.no_grad():
predictions = model(frame_tensor)
labels = predictions[0]['labels'].cpu().numpy()
scores = predictions[0]['scores'].cpu().numpy()
boxes = predictions[0]['boxes'].cpu().numpy()
for i in range(len(boxes)):
if scores[i] > threshold:
xmin, ymin, xmax, ymax = boxes[i].astype('int')
class_name = COCO_NAMES[labels[i] - 1]
# Extract the detected object from the frame
object_segment = frame[ymin:ymax, xmin:xmax]
# Convert to grayscale and threshold to create a mask
gray = cv2.cvtColor(object_segment, cv2.COLOR_BGR2GRAY)
_, mask = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
# Find contours
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Draw the contours on the original frame
cv2.drawContours(frame[ymin:ymax, xmin:xmax], contours, -1, (0, 255, 0), thickness=cv2.FILLED)
# Put label above the box
label = f"{class_name}: {scores[i]:.2f}"
cv2.putText(frame, label, (xmin, ymin - 10), FONT, 0.5, (255, 0, 0), 2, cv2.LINE_AA)
return frame
# Capture video
cap = cv2.VideoCapture(VIDEO) # replace with actual video file path
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
# Get frame width and height
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out = cv2.VideoWriter(OUTPUT_VIDEO_SSD_SEG, fourcc, 20.0, (frame_width, frame_height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
print("No frame...")
break
# Overlay segmentation masks
processed_frame = ssd_object_segmentation(frame)
# Write the processed frame to output
out.write(processed_frame)
# Display the frame
cv2.imshow('Frame', processed_frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release everything once finished
cap.release()
out.release()
cv2.destroyAllWindows()
4.評估
在本節(jié)中,我將評估并比較三種流行的物體檢測模型:YOLO(You Only Look Once)、Faster R-CNN(基于區(qū)域的卷積神經(jīng)網(wǎng)絡(luò))和SSD(單次多框檢測器)。我在CPU設(shè)備上工作,而不是CUDA。評估階段包括:
- 每秒幀數(shù)(FPS):FPS衡量每個(gè)模型每秒處理的幀數(shù)。
- 推理時(shí)間:推理時(shí)間表示每個(gè)模型檢測幀中物體所需的時(shí)間。
- 模型大?。耗P痛笮”硎久總€(gè)模型占用的磁盤空間。
(1) 性能差異討論
從評估結(jié)果中,我觀察到以下內(nèi)容:
- 速度:YOLO在FPS和推理時(shí)間方面優(yōu)于Faster R-CNN和SSD。這表明它適用于實(shí)時(shí)應(yīng)用。
- 準(zhǔn)確性:Faster R-CNN在準(zhǔn)確性上往往優(yōu)于YOLO和SSD,表明在物體檢測任務(wù)中具有更好的準(zhǔn)確性。
- 模型大小:YOLO的模型大小最小,這使得它在存儲容量有限的設(shè)備上具有優(yōu)勢。
(2) 最佳表現(xiàn)方法
根據(jù)評估結(jié)果和定性分析,YOLO8v是視頻序列中物體檢測和分割的最佳SoA方法。其卓越的速度、緊湊的模型大小和強(qiáng)大的性能使其成為在實(shí)際應(yīng)用中準(zhǔn)確性和效率至關(guān)重要的理想選擇。
完整項(xiàng)目代碼和視頻:https://github.com/fatimagulomova/iu-projects/blob/main/DLBAIPCV01/MainProject.ipynb