【大數(shù)據(jù)】通過 Docker-compose 快速部署 Presto(Trino)
一、概述
Presto是一個快速的分布式查詢引擎,最初由Facebook開發(fā),目前歸屬于 Presto Software Foundation(由 Facebook、Teradata 和其他公司共同支持)。Presto的核心特點是支持遠(yuǎn)程數(shù)據(jù)訪問,可以查詢包括Hadoop、Cassandra、Relational databases、NoSQL databases在內(nèi)的多個數(shù)據(jù)源。Presto支持標(biāo)準(zhǔn)的SQL語法,同時提供了一些擴(kuò)展功能,如分布式查詢、動態(tài)分區(qū)、自定義聚合和分析函數(shù)等。
但是Presto目前有兩大分支:PrestoDB(背靠Facebook)和 PrestoSQL現(xiàn)在改名為Trino(Presto的創(chuàng)始團(tuán)隊),雖然PrestoDB背靠Facebook,但是社區(qū)活躍度和使用群體還是遠(yuǎn)不如Trino。所以這里以Trino為主展開講解。
二、前期準(zhǔn)備
1)部署 docker
# 安裝yum-config-manager配置工具
yum -y install yum-utils
# 建議使用阿里云yum源:(推薦)
#yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
# 安裝docker-ce版本
yum install -y docker-ce
# 啟動并開機(jī)啟動
systemctl enable --now docker
docker --version
2)部署 docker-compose
curl -SL https://github.com/docker/compose/releases/download/v2.16.0/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
docker-compose --version
三、創(chuàng)建網(wǎng)絡(luò)
# 創(chuàng)建,注意不能使用hadoop_network,要不然啟動hs2服務(wù)的時候會有問題?。?!
docker network create hadoop-network
# 查看
docker network ls
四、Trino 編排部署
1)下載 trino
官方下載地址:https://trino.io/download.html
# trino server
wget https://repo1.maven.org/maven2/io/trino/trino-server/416/trino-server-416.tar.gz
# trino Command line client
wget https://repo1.maven.org/maven2/io/trino/trino-cli/416/trino-cli-416-executable.jar
# jdk
wget https://cdn.azul.com/zulu/bin/zulu20.30.11-ca-jdk20.0.1-linux_x64.tar.gz
2)配置
首先創(chuàng)建etc和data目錄,后面配置文件需要用到
mkdir -p etc/{coordinator,worker} etc/catalog/ images
1、coordinator 配置
- node.properties
cat << EOF > etc/coordinator/node.properties
# 環(huán)境的名字。集群中所有的Trino節(jié)點必須具有相同的環(huán)境名稱。
node.environment=test
# 此Trino安裝的唯一標(biāo)識符。這對于每個節(jié)點都必須是唯一的。
node.id=trino-coordinator
# 數(shù)據(jù)目錄的位置(文件系統(tǒng)路徑)。Trino在這里存儲日志和其他數(shù)據(jù)。
node.data-dir=/opt/apache/trino/data
EOF
- jvm.config
cat << EOF > etc/coordinator/jvm.config
-server
-Xmx2G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics
# Disable Preventive GC for performance reasons (JDK-8293861)
-XX:-G1UsePreventiveGC
EOF
- config.properties
cat << EOF > etc/coordinator/config.properties
# 設(shè)置該節(jié)點為coordinator節(jié)點
coordinator=true
# 允許在協(xié)調(diào)器上調(diào)度工作,也就是coordinator節(jié)點又充當(dāng)worker節(jié)點用
node-scheduler.include-coordinator=false
# 指定HTTP服務(wù)器的端口。Trino使用HTTP進(jìn)行內(nèi)部和外部web的所有通信。
http-server.http.port=8080
# 查詢可以使用的最大分布式內(nèi)存?!咀⒁狻坎荒芘渲贸^jvm配置的最大堆棧內(nèi)存大小
query.max-memory=1GB
# 查詢可以在任何一臺機(jī)器上使用的最大用戶內(nèi)存?!咀⒁狻恳彩遣荒芘渲贸^jvm配置的最大堆棧內(nèi)存大小
query.max-memory-per-node=1GB
# hadoop-node1也可以是IP
discovery.uri=http://localhost:8080
EOF
- log.properties
cat << EOF > etc/coordinator/log.properties
# 設(shè)置日志級別,有四個級別:DEBUG, INFO, WARN and ERROR
io.trino=INFO
EOF
2、worker 配置
- node.properties
cat << EOF > etc/worker/node.properties
# 環(huán)境的名字。集群中所有的Trino節(jié)點必須具有相同的環(huán)境名稱。
node.environment=test
# 此Trino安裝的唯一標(biāo)識符。這對于每個節(jié)點都必須是唯一的。
# node.id=trino-worker
# 數(shù)據(jù)目錄的位置(文件系統(tǒng)路徑)。Trino在這里存儲日志和其他數(shù)據(jù)。
node.data-dir=/opt/apache/trino/data
EOF
- jvm.config
cat << EOF > etc/worker/jvm.config
-server
-Xmx2G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics
# Disable Preventive GC for performance reasons (JDK-8293861)
-XX:-G1UsePreventiveGC
EOF
- config.properties
cat << EOF > etc/worker/config.properties
# 設(shè)置該節(jié)點為worker節(jié)點
coordinator=false
# 指定HTTP服務(wù)器的端口。Trino使用HTTP進(jìn)行內(nèi)部和外部web的所有通信。
http-server.http.port=8080
# 查詢可以使用的最大分布式內(nèi)存?!咀⒁狻坎荒芘渲贸^jvm配置的最大堆棧內(nèi)存大小
query.max-memory=1GB
# 查詢可以在任何一臺機(jī)器上使用的最大用戶內(nèi)存?!咀⒁狻恳彩遣荒芘渲贸^jvm配置的最大堆棧內(nèi)存大小
query.max-memory-per-node=1GB
# hadoop-node1也可以是IP
discovery.uri=http://trino-coordinator:8080
EOF
- log.properties
cat << EOF > etc/worker/log.properties
# 設(shè)置日志級別,有四個級別:DEBUG, INFO, WARN and ERROR
io.trino=INFO
EOF
3)啟動腳本 bootstrap.sh
#!/usr/bin/env sh
wait_for() {
echo Waiting for $1 to listen on $2...
while ! nc -z $1 $2; do echo waiting...; sleep 1s; done
}
start_trino() {
node_type=$1
if [ "$node_type" = "worker" ];then
wait_for trino-coordinator 8080
fi
${TRINO_HOME}/bin/launcher run --verbose
}
case $1 in
trino-coordinator)
start_trino coordinator
;;
trino-worker)
start_trino worker
;;
*)
echo "請輸入正確的服務(wù)啟動命令~"
;;
esac
4)構(gòu)建鏡像 Dockerfile
FROM registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/centos:7.7.1908
RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
RUN export LANG=zh_CN.UTF-8
# 創(chuàng)建用戶和用戶組,跟yaml編排里的user: 10000:10000
RUN groupadd --system --gid=10000 hadoop && useradd --system --home-dir /home/hadoop --uid=10000 --gid=hadoop hadoop -m
# 安裝sudo
RUN yum -y install sudo ; chmod 640 /etc/sudoers
# 給hadoop添加sudo權(quán)限
RUN echo "hadoop ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN yum -y install install net-tools telnet wget nc
RUN mkdir /opt/apache/
# 添加配置 JDK
ADD zulu20.30.11-ca-jdk20.0.1-linux_x64.tar.gz /opt/apache/
ENV JAVA_HOME /opt/apache/zulu20.30.11-ca-jdk20.0.1-linux_x64
ENV PATH $JAVA_HOME/bin:$PATH
# 添加配置 trino server
ENV TRINO_VERSION 416
ADD trino-server-${TRINO_VERSION}.tar.gz /opt/apache/
ENV TRINO_HOME /opt/apache/trino
RUN ln -s /opt/apache/trino-server-${TRINO_VERSION} $TRINO_HOME
# 創(chuàng)建配置目錄和數(shù)據(jù)源catalog目錄
RUN mkdir -p ${TRINO_HOME}/etc/catalog
# 添加配置 trino cli
COPY trino-cli-416-executable.jar $TRINO_HOME/bin/trino-cli
# copy bootstrap.sh
COPY bootstrap.sh /opt/apache/
RUN chmod +x /opt/apache/bootstrap.sh ${TRINO_HOME}/bin/trino-cli
RUN chown -R hadoop:hadoop /opt/apache
WORKDIR $TRINO_HOME
開始構(gòu)建鏡像
docker build -t registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino:416 . --no-cache
# 為了方便小伙伴下載即可使用,我這里將鏡像文件推送到阿里云的鏡像倉庫
docker push registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino:416
### 參數(shù)解釋
# -t:指定鏡像名稱
# . :當(dāng)前目錄Dockerfile
# -f:指定Dockerfile路徑
# --no-cache:不緩存
5)編排 docker-compose.yaml
version: '3'
services:
trino-coordinator:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino:416
user: "hadoop:hadoop"
container_name: trino-coordinator
hostname: trino-coordinator
restart: always
privileged: true
env_file:
- .env
volumes:
- ./etc/coordinator/config.properties:${TRINO_HOME}/etc/config.properties
- ./etc/coordinator/jvm.config:${TRINO_HOME}/etc/jvm.config
- ./etc/coordinator/log.properties:${TRINO_HOME}/etc/log.properties
- ./etc/coordinator/node.properties:${TRINO_HOME}/etc/node.properties
- ./etc/catalog/:${TRINO_HOME}/etc/catalog/
ports:
- "30080:${TRINO_SERVER_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh trino-coordinator"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${TRINO_SERVER_PORT}/v1/info || exit 1"]
interval: 10s
timeout: 20s
retries: 3
trino-worker:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino:416
user: "hadoop:hadoop"
restart: always
privileged: true
deploy:
replicas: 3
env_file:
- .env
volumes:
- ./etc/worker/config.properties:${TRINO_HOME}/etc/config.properties
- ./etc/worker/jvm.config:${TRINO_HOME}/etc/jvm.config
- ./etc/worker/log.properties:${TRINO_HOME}/etc/log.properties
- ./etc/worker/node.properties:${TRINO_HOME}/etc/node.properties
- ./etc/catalog/:${TRINO_HOME}/etc/catalog/
expose:
- "${TRINO_SERVER_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh trino-worker"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${TRINO_SERVER_PORT}/v1/info || exit 1"]
interval: 10s
timeout: 10s
retries: 3
# 連接外部網(wǎng)絡(luò)
networks:
hadoop-network:
external: true
.env 文件內(nèi)容如下:
cat << EOF > .env
TRINO_SERVER_PORT=8080
HADOOP_HDFS_DN_PORT=/opt/apache/trino
EOF
6)開始部署
docker-compose -f docker-compose.yaml up -d
# 查看
docker-compose -f docker-compose.yaml ps
web 地址:http://ip:30080
五、簡單測試驗證
hive和mysql快熟部署文檔可參考我這篇文章:通過 docker-compose 快速部署 Hive 詳細(xì)教程
1)mysql 數(shù)據(jù)源
添加 mysql 數(shù)據(jù)源,在宿主機(jī)上配置即可,因已經(jīng)掛載了
cat << EOF > ./etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://mysql:3306
connection-user=root
connection-password=123456
EOF
重啟 trino
docker-compose -f docker-compose.yaml restart
測試驗證
# 登錄容器
docker exec -it trino-coordinator bash
${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --user=hadoop
# 查看數(shù)據(jù)源
show catalogs;
# 查看mysql庫
show schemas from mysql;
# 查看表
show tables from mysql.hive_metastore;
# 查看表數(shù)據(jù)
select * from mysql.hive_metastore.version;
2)hive 數(shù)據(jù)源
添加 hive 數(shù)據(jù)源,在宿主機(jī)上配置即可,因已經(jīng)掛載了
cat << EOF > etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://hive-metastore:9083
EOF
重啟 trino
docker-compose -f docker-compose.yaml restart
測試驗證
# 登錄容器
docker exec -it trino-coordinator bash
${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --user=hadoop
# 查看數(shù)據(jù)源
show catalogs;
# 查看mysql庫
show schemas from hive;
# 查看表
show tables from hive.default;
# 查看表數(shù)據(jù)
select * from hive.default.student;