基于 PyTorch 的人臉關鍵點檢測
計算機真的能理解人臉嗎?你是否想過Instagram是如何給你的臉上應用驚人的濾鏡的?該軟件檢測你臉上的關鍵點并在其上投影一個遮罩。本教程將教你如何使用PyTorch構建一個類似的軟件。
數據集
在本教程中,我們將使用官方的DLib數據集,其中包含6666張尺寸不同的圖像。此外,labels_ibug_300W_train.xml(隨數據集提供)包含每張人臉的68個關鍵點的坐標。下面的腳本將在Colab筆記本中下載數據集并解壓縮。
if not os.path.exists('/content/ibug_300W_large_face_landmark_dataset'):
!wget http://dlib.net/files/data/ibug_300W_large_face_landmark_dataset.tar.gz
!tar -xvzf 'ibug_300W_large_face_landmark_dataset.tar.gz'
!rm -r 'ibug_300W_large_face_landmark_dataset.tar.gz'
這是數據集中的一張樣本圖像。我們可以看到,人臉只占整個圖像的一小部分。如果我們將完整圖像輸入神經網絡,它也會處理背景(無關信息),這會使模型難以學習。因此,我們需要裁剪圖像,僅輸入人臉部分。
數據集中的樣本圖像和關鍵點
數據預處理
為了防止神經網絡過擬合訓練數據集,我們需要隨機變換數據集。我們將對訓練和驗證數據集應用以下操作:
- 由于人臉只占整個圖像的一小部分,所以裁剪圖像并僅使用人臉進行訓練。
- 將裁剪后的人臉調整為(224x224)的圖像。
- 隨機改變調整后的人臉的亮度和飽和度。
- 在上述三個轉換之后,隨機旋轉人臉。
- 將圖像和關鍵點轉換為torch張量,并在[-1, 1]之間進行歸一化。
class Transforms():
def __init__(self):
pass
def rotate(self, image, landmarks, angle):
angle = random.uniform(-angle, +angle)
transformation_matrix = torch.tensor([
[+cos(radians(angle)), -sin(radians(angle))],
[+sin(radians(angle)), +cos(radians(angle))]
])
image = imutils.rotate(np.array(image), angle)
landmarks = landmarks - 0.5
new_landmarks = np.matmul(landmarks, transformation_matrix)
new_landmarks = new_landmarks + 0.5
return Image.fromarray(image), new_landmarks
def resize(self, image, landmarks, img_size):
image = TF.resize(image, img_size)
return image, landmarks
def color_jitter(self, image, landmarks):
color_jitter = transforms.ColorJitter(brightness=0.3,
contrast=0.3,
saturation=0.3,
hue=0.1)
image = color_jitter(image)
return image, landmarks
def crop_face(self, image, landmarks, crops):
left = int(crops['left'])
top = int(crops['top'])
width = int(crops['width'])
height = int(crops['height'])
image = TF.crop(image, top, left, height, width)
img_shape = np.array(image).shape
landmarks = torch.tensor(landmarks) - torch.tensor([[left, top]])
landmarks = landmarks / torch.tensor([img_shape[1], img_shape[0]])
return image, landmarks
def __call__(self, image, landmarks, crops):
image = Image.fromarray(image)
image, landmarks = self.crop_face(image, landmarks, crops)
image, landmarks = self.resize(image, landmarks, (224, 224))
image, landmarks = self.color_jitter(image, landmarks)
image, landmarks = self.rotate(image, landmarks, angle=10)
image = TF.to_tensor(image)
image = TF.normalize(image, [0.5], [0.5])
return image, landmarks
數據集類
現在我們已經準備好了轉換,讓我們編寫我們的數據集類。labels_ibug_300W_train.xml包含圖像路徑、關鍵點和邊界框的坐標(用于裁剪人臉)。我們將這些值存儲在列表中,以便在訓練期間輕松訪問。在本文章中,神經網絡將在灰度圖像上進行訓練。
class FaceLandmarksDataset(Dataset):
def __init__(self, transform=None):
tree = ET.parse('ibug_300W_large_face_landmark_dataset/labels_ibug_300W_train.xml')
root = tree.getroot()
self.image_filenames = []
self.landmarks = []
self.crops = []
self.transform = transform
self.root_dir = 'ibug_300W_large_face_landmark_dataset'
for filename in root[2]:
self.image_filenames.append(os.path.join(self.root_dir, filename.attrib['file']))
self.crops.append(filename[0].attrib)
landmark = []
for num in range(68):
x_coordinate = int(filename[0][num].attrib['x'])
y_coordinate = int(filename[0][num].attrib['y'])
landmark.append([x_coordinate, y_coordinate])
self.landmarks.append(landmark)
self.landmarks = np.array(self.landmarks).astype('float32')
assert len(self.image_filenames) == len(self.landmarks)
def __len__(self):
return len(self.image_filenames)
def __getitem__(self, index):
image = cv2.imread(self.image_filenames[index], 0)
landmarks = self.landmarks[index]
if self.transform:
image, landmarks = self.transform(image, landmarks, self.crops[index])
landmarks = landmarks - 0.5
return image, landmarks
dataset = FaceLandmarksDataset(Transforms())
注意:landmarks = landmarks - 0.5 是為了將關鍵點居中,因為中心化的輸出對神經網絡學習更容易。經過預處理后的數據集輸出如下所示(關鍵點已經繪制在圖像中):
預處理后的數據樣本
神經網絡
我們將使用ResNet18作為基本框架。我們需要修改第一層和最后一層以適應我們的目的。在第一層中,我們將輸入通道數設為1,以便神經網絡接受灰度圖像。同樣,在最后一層中,輸出通道數應為68 * 2 = 136,以便模型預測每張人臉的68個關鍵點的(x,y)坐標。
class Network(nn.Module):
def __init__(self,num_classes=136):
super().__init__()
self.model_name='resnet18'
self.model=models.resnet18()
self.model.conv1=nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.model.fc=nn.Linear(self.model.fc.in_features, num_classes)
def forward(self, x):
x=self.model(x)
return x
訓練神經網絡
我們將使用預測關鍵點和真實關鍵點之間的均方誤差作為損失函數。請記住,要避免梯度爆炸,學習率應保持低。每當驗證損失達到新的最小值時,網絡權重將被保存。至少訓練20個epochs以獲得最佳性能。
network = Network()
network.cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(network.parameters(), lr=0.0001)
loss_min = np.inf
num_epochs = 10
start_time = time.time()
for epoch in range(1,num_epochs+1):
loss_train = 0
loss_valid = 0
running_loss = 0
network.train()
for step in range(1,len(train_loader)+1):
images, landmarks = next(iter(train_loader))
images = images.cuda()
landmarks = landmarks.view(landmarks.size(0),-1).cuda()
predictions = network(images)
# clear all the gradients before calculating them
optimizer.zero_grad()
# find the loss for the current step
loss_train_step = criterion(predictions, landmarks)
# calculate the gradients
loss_train_step.backward()
# update the parameters
optimizer.step()
loss_train += loss_train_step.item()
running_loss = loss_train/step
print_overwrite(step, len(train_loader), running_loss, 'train')
network.eval()
with torch.no_grad():
for step in range(1,len(valid_loader)+1):
images, landmarks = next(iter(valid_loader))
images = images.cuda()
landmarks = landmarks.view(landmarks.size(0),-1).cuda()
predictions = network(images)
# find the loss for the current step
loss_valid_step = criterion(predictions, landmarks)
loss_valid += loss_valid_step.item()
running_loss = loss_valid/step
print_overwrite(step, len(valid_loader), running_loss, 'valid')
loss_train /= len(train_loader)
loss_valid /= len(valid_loader)
print('\n--------------------------------------------------')
print('Epoch: {} Train Loss: {:.4f} Valid Loss: {:.4f}'.format(epoch, loss_train, loss_valid))
print('--------------------------------------------------')
if loss_valid < loss_min:
loss_min = loss_valid
torch.save(network.state_dict(), '/content/face_landmarks.pth')
print("\nMinimum Validation Loss of {:.4f} at epoch {}/{}".format(loss_min, epoch, num_epochs))
print('Model Saved\n')
print('Training Complete')
print("Total Elapsed Time : {} s".format(time.time()-start_time))
在未知數據上進行預測
使用以下代碼段在未知圖像中預測關鍵點。
import time
import cv2
import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import imutils
import torch
import torch.nn as nn
from torchvision import models
import torchvision.transforms.functional as TF
#######################################################################
image_path = 'pic.jpg'
weights_path = 'face_landmarks.pth'
frontal_face_cascade_path = 'haarcascade_frontalface_default.xml'
#######################################################################
class Network(nn.Module):
def __init__(self,num_classes=136):
super().__init__()
self.model_name='resnet18'
self.model=models.resnet18(pretrained=False)
self.model.conv1=nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.model.fc=nn.Linear(self.model.fc.in_features,num_classes)
def forward(self, x):
x=self.model(x)
return x
#######################################################################
face_cascade = cv2.CascadeClassifier(frontal_face_cascade_path)
best_network = Network()
best_network.load_state_dict(torch.load(weights_path, map_location=torch.device('cpu')))
best_network.eval()
image = cv2.imread(image_path)
grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
display_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
height, width,_ = image.shape
faces = face_cascade.detectMultiScale(grayscale_image, 1.1, 4)
all_landmarks = []
for (x, y, w, h) in faces:
image = grayscale_image[y:y+h, x:x+w]
image = TF.resize(Image.fromarray(image), size=(224, 224))
image = TF.to_tensor(image)
image = TF.normalize(image, [0.5], [0.5])
with torch.no_grad():
landmarks = best_network(image.unsqueeze(0))
landmarks = (landmarks.view(68,2).detach().numpy() + 0.5) * np.array([[w, h]]) + np.array([[x, y]])
all_landmarks.append(landmarks)
plt.figure()
plt.imshow(display_image)
for landmarks in all_landmarks:
plt.scatter(landmarks[:,0], landmarks[:,1], c = 'c', s = 5)
plt.show()
OpenCV Haar級聯分類器用于檢測圖像中的人臉。使用Haar級聯進行對象檢測是一種基于機器學習的方法,其中使用一組輸入數據對級聯函數進行訓練。OpenCV已經包含了許多預訓練的分類器,用于人臉、眼睛、行人等等。在我們的案例中,我們將使用人臉分類器,你需要下載預訓練的分類器XML文件并將其保存到你的工作目錄中。
人臉檢測
在輸入圖像中檢測到的人臉將被裁剪、調整大小為(224,224)并輸入到我們訓練好的神經網絡中以預測其中的關鍵點。
裁剪人臉上的關鍵點
在裁剪的人臉上疊加預測的關鍵點。結果如下圖所示。相當令人印象深刻,不是嗎?
最終結果
同樣,在多個人臉上進行關鍵點檢測:
在這里,你可以看到OpenCV Haar級聯分類器已經檢測到了多個人臉,包括一個誤報(一個拳頭被預測為人臉)。