Flink實(shí)時(shí)計(jì)算Pv、Uv的幾種方法
本文轉(zhuǎn)載自微信公眾號(hào)「Java大數(shù)據(jù)與數(shù)據(jù)倉(cāng)庫(kù)」,作者柯少爺。轉(zhuǎn)載本文請(qǐng)聯(lián)系Java大數(shù)據(jù)與數(shù)據(jù)倉(cāng)庫(kù)公眾號(hào)。
實(shí)時(shí)統(tǒng)計(jì)pv、uv是再常見(jiàn)不過(guò)的大數(shù)據(jù)統(tǒng)計(jì)需求了,前面出過(guò)一篇SparkStreaming實(shí)時(shí)統(tǒng)計(jì)pv,uv的案例,這里用Flink實(shí)時(shí)計(jì)算pv,uv。
我們需要統(tǒng)計(jì)不同數(shù)據(jù)類(lèi)型每天的pv,uv情況,并且有如下要求.
- 每秒鐘要輸出最新的統(tǒng)計(jì)結(jié)果;
- 程序永遠(yuǎn)跑著不會(huì)停,所以要定期清理內(nèi)存里的過(guò)時(shí)數(shù)據(jù);
- 收到的消息里的時(shí)間字段并不是按照順序嚴(yán)格遞增的,所以要有一定的容錯(cuò)機(jī)制;
- 訪問(wèn)uv并不一定每秒鐘都會(huì)變化,重復(fù)輸出對(duì)IO是巨大的浪費(fèi),所以要在uv變更時(shí)在一秒內(nèi)輸出結(jié)果,未變更時(shí)不輸出;
Flink數(shù)據(jù)流上的類(lèi)型和操作
DataStream是flink流處理最核心的數(shù)據(jù)結(jié)構(gòu),其它的各種流都可以直接或者間接通過(guò)DataStream來(lái)完成相互轉(zhuǎn)換,一些常用的流直接的轉(zhuǎn)換關(guān)系如圖:
可以看出,DataStream可以與KeyedStream相互轉(zhuǎn)換,KeyedStream可以轉(zhuǎn)換為WindowedStream,DataStream不能直接轉(zhuǎn)換為WindowedStream,WindowedStream可以直接轉(zhuǎn)換為DataStream。各種流之間雖然不能相互直接轉(zhuǎn)換,但是都可以通過(guò)先轉(zhuǎn)換為DataStream,再轉(zhuǎn)換為其它流的方法來(lái)實(shí)現(xiàn)。
在這個(gè)計(jì)算pv,uv的需求中就主要用到DataStream、KeyedStream以及WindowedStream這些數(shù)據(jù)結(jié)構(gòu)。
這里需要用到window和watermark,使用窗口把數(shù)據(jù)按天分割,使用watermark可以通過(guò)“水位”來(lái)定期清理窗口外的遲到數(shù)據(jù),起到清理內(nèi)存的作用。
業(yè)務(wù)代碼
我們的數(shù)據(jù)是json類(lèi)型的,含有date,version,guid這3個(gè)字段,在實(shí)時(shí)統(tǒng)計(jì)pv,uv這個(gè)功能中,其它字段可以直接丟掉,當(dāng)然了在離線數(shù)據(jù)倉(cāng)庫(kù)中,所有有含義的業(yè)務(wù)字段都是要保留到hive當(dāng)中的。其它相關(guān)概念就不說(shuō)了,會(huì)專(zhuān)門(mén)介紹,這里直接上代碼吧。
- <?xml version="1.0" encoding="UTF-8"?>
- <project xmlns="http://maven.apache.org/POM/4.0.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
- <modelVersion>4.0.0</modelVersion>
- <groupId>com.ddxygq</groupId>
- <artifactId>bigdata</artifactId>
- <version>1.0-SNAPSHOT</version>
- <properties>
- <scala.version>2.11.8</scala.version>
- <flink.version>1.7.0</flink.version>
- <pkg.name>bigdata</pkg.name>
- </properties>
- <dependencies>
- <dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-scala_2.11</artifactId>
- <version>{flink.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-streaming-scala_2.11</artifactId>
- <version>flink.version</version>
- </dependency>
- <dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-streaming-java_2.11</artifactId>
- <version>{flink.version}</version>
- </dependency>
- <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.8 -->
- <dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-connector-kafka-0.10_2.11</artifactId>
- <version>flink.version</version>
- </dependency>
- <build>
- <!--測(cè)試代碼和文件-->
- <!--<testSourceDirectory>{basedir}/src/test</testSourceDirectory>-->
- <finalName>basedir/src/test</testSourceDirectory>−−><finalName>{pkg.name}</finalName>
- <sourceDirectory>src/main/java</sourceDirectory>
- <resources>
- <resource>
- <directory>src/main/resources</directory>
- <includes>
- <include>*.properties</include>
- <include>*.xml</include>
- </includes>
- <filtering>false</filtering>
- </resource>
- </resources>
- <plugins>
- <!-- 跳過(guò)測(cè)試插件-->
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-surefire-plugin</artifactId>
- <configuration>
- <skip>true</skip>
- </configuration>
- </plugin>
- <!--編譯scala插件-->
- <plugin>
- <groupId>org.scala-tools</groupId>
- <artifactId>maven-scala-plugin</artifactId>
- <version>2.15.2</version>
- <executions>
- <execution>
- <goals>
- <goal>compile</goal>
- <goal>testCompile</goal>
- </goals>
- </execution>
- </executions>
- </plugin>
- </plugins>
- </build>
- </project>
主要代碼,主要使用scala開(kāi)發(fā):
- package com.ddxygq.bigdata.flink.streaming.pvuv
- import java.util.Properties
- import com.alibaba.fastjson.JSON
- import org.apache.flink.runtime.state.filesystem.FsStateBackend
- import org.apache.flink.streaming.api.CheckpointingMode
- import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
- import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
- import org.apache.flink.streaming.api.windowing.time.Time
- import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger
- import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
- import org.apache.flink.streaming.util.serialization.SimpleStringSchema
- import org.apache.flink.streaming.api.scala.extensions._
- import org.apache.flink.api.scala._
- /**
- * @ Author: keguang
- * @ Date: 2019/3/18 17:34
- * @ version: v1.0.0
- * @ description:
- */
- object PvUvCount {
- def main(args: Array[String]): Unit = {
- val env = StreamExecutionEnvironment.getExecutionEnvironment
- // 容錯(cuò)
- env.enableCheckpointing(5000)
- env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
- env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/flink/tagApp"))
- // kafka 配置
- val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
- val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
- val TRANSACTION_GROUP = "flink-count"
- val TOPIC_NAME = "flink"
- val kafkaProps = new Properties()
- kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
- kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
- kafkaProps.setProperty("group.id", TRANSACTION_GROUP)
- // watrmark 允許數(shù)據(jù)延遲時(shí)間
- val MaxOutOfOrderness = 86400 * 1000L
- // 消費(fèi)kafka數(shù)據(jù)
- val streamData: DataStream[(String, String, String)] = env.addSource(
- new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps)
- ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) {
- override def extractTimestamp(element: String): Long = {
- val t = JSON.parseObject(element)
- val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time")
- time.toLong
- }
- }).map(x => {
- var date = "error"
- var guid = "error"
- var helperversion = "error"
- try {
- val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message"))
- val datetime = messageJsonObject.getString("time")
- date = datetime.split(" ")(0)
- // hour = datetime.split(" ")(1).substring(0, 2)
- val decrypted_data_string = messageJsonObject.getString("decrypted_data")
- if (!"".equals(decrypted_data_string)) {
- val decrypted_data = JSON.parseObject(decrypted_data_string)
- guid = decrypted_data.getString("guid").trim
- helperversion = decrypted_data.getString("helperversion")
- }
- } catch {
- case e: Exception => {
- println(e)
- }
- }
- (date, helperversion, guid)
- })
- // 這上面是設(shè)置watermark并解析json部分
- // 聚合窗口中的數(shù)據(jù),可以研究下applyWith這個(gè)方法和OnWindowedStream這個(gè)類(lèi)
- val resultStream = streamData.keyBy(x => {
- x._1 + x._2
- }).timeWindow(Time.days(1))
- .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1)))
- .applyWith(("", List.empty[Int], Set.empty[Int], 0L, 0L))(
- foldFunction = {
- case ((_, list, set, _, 0), item) => {
- val date = item._1
- val helperversion = item._2
- val guid = item._3
- (date + "_" + helperversion, guid.hashCode +: list, set + guid.hashCode, 0L, 0L)
- }
- }
- , windowFunction = {
- case (key, window, result) => {
- result.map {
- case (leixing, list, set, _, _) => {
- (leixing, list.size, set.size, window.getStart, window.getEnd)
- }
- }
- }
- }
- ).keyBy(0)
- .flatMapWithState[(String, Int, Int, Long, Long),(Int, Int)]{
- case ((key, numpv, numuv, begin, end), curr) =>
- curr match {
- case Some(numCurr) if numCurr == (numuv, numpv) =>
- (Seq.empty, Some((numuv, numpv))) //如果之前已經(jīng)有相同的數(shù)據(jù),則返回空結(jié)果
- case _ =>
- (Seq((key, numpv, numuv, begin, end)), Some((numuv, numpv)))
- }
- }
- // 最終結(jié)果
- val resultedStream = resultStream.map(x => {
- val keys = x._1.split("_")
- val date = keys(0)
- val helperversion = keys(1)
- (date, helperversion, x._2, x._3)
- })
- resultedStream.print()
- env.execute("PvUvCount")
- }
- }
使用List集合的size保存pv,使用Set集合的size保存uv,從而達(dá)到實(shí)時(shí)統(tǒng)計(jì)pv,uv的目的。
這里用了幾個(gè)關(guān)鍵的函數(shù):
applyWith:里面需要的參數(shù),初始狀態(tài)變量,和foldFunction ,windowFunction ;
存在的問(wèn)題
顯然,當(dāng)數(shù)據(jù)量很大的時(shí)候,這個(gè)List集合和Set集合會(huì)很大,并且這里的pv是否可以不用List來(lái)存儲(chǔ),而是通過(guò)一個(gè)狀態(tài)變量,不斷做累加,對(duì)應(yīng)操作就是更新?tīng)顟B(tài)來(lái)完成。
改進(jìn)版
使用了一個(gè)計(jì)數(shù)器來(lái)存儲(chǔ)pv的值。
- packagecom.ddxygq.bigdata.flink.streaming.pvuv
- import java.util.Properties
- import com.alibaba.fastjson.JSON
- import org.apache.flink.api.common.accumulators.IntCounter
- import org.apache.flink.runtime.state.filesystem.FsStateBackend
- import org.apache.flink.streaming.api.CheckpointingMode
- import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
- import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
- import org.apache.flink.streaming.api.windowing.time.Time
- import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger
- import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
- import org.apache.flink.streaming.util.serialization.SimpleStringSchema
- import org.apache.flink.streaming.api.scala.extensions._
- import org.apache.flink.api.scala._
- import org.apache.flink.core.fs.FileSystem
- object PvUv2 {
- def main(args: Array[String]): Unit = {
- val env = StreamExecutionEnvironment.getExecutionEnvironment
- // 容錯(cuò)
- env.enableCheckpointing(5000)
- env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
- env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/streaming/counter"))
- // kafka 配置
- val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
- val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
- val TRANSACTION_GROUP = "flink-count"
- val TOPIC_NAME = "flink"
- val kafkaProps = new Properties()
- kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
- kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
- kafkaProps.setProperty("group.id", TRANSACTION_GROUP)
- // watrmark 允許數(shù)據(jù)延遲時(shí)間
- val MaxOutOfOrderness = 86400 * 1000L
- val streamData: DataStream[(String, String, String)] = env.addSource(
- new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps)
- ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) {
- override def extractTimestamp(element: String): Long = {
- val t = JSON.parseObject(element)
- val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time")
- time.toLong
- }
- }).map(x => {
- var date = "error"
- var guid = "error"
- var helperversion = "error"
- try {
- val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message"))
- val datetime = messageJsonObject.getString("time")
- date = datetime.split(" ")(0)
- // hour = datetime.split(" ")(1).substring(0, 2)
- val decrypted_data_string = messageJsonObject.getString("decrypted_data")
- if (!"".equals(decrypted_data_string)) {
- val decrypted_data = JSON.parseObject(decrypted_data_string)
- guid = decrypted_data.getString("guid").trim
- helperversion = decrypted_data.getString("helperversion")
- }
- } catch {
- case e: Exception => {
- println(e)
- }
- }
- (date, helperversion, guid)
- })
- val resultStream = streamData.keyBy(x => {
- x._1 + x._2
- }).timeWindow(Time.days(1))
- .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1)))
- .applyWith(("", new IntCounter(), Set.empty[Int], 0L, 0L))(
- foldFunction = {
- case ((_, cou, set, _, 0), item) => {
- val date = item._1
- val helperversion = item._2
- val guid = item._3
- cou.add(1)
- (date + "_" + helperversion, cou, set + guid.hashCode, 0L, 0L)
- }
- }
- , windowFunction = {
- case (key, window, result) => {
- result.map {
- case (leixing, cou, set, _, _) => {
- (leixing, cou.getLocalValue, set.size, window.getStart, window.getEnd)
- }
- }
- }
- }
- ).keyBy(0)
- .flatMapWithState[(String, Int, Int, Long, Long),(Int, Int)]{
- case ((key, numpv, numuv, begin, end), curr) =>
- curr match {
- case Some(numCurr) if numCurr == (numuv, numpv) =>
- (Seq.empty, Some((numuv, numpv))) //如果之前已經(jīng)有相同的數(shù)據(jù),則返回空結(jié)果
- case _ =>
- (Seq((key, numpv, numuv, begin, end)), Some((numuv, numpv)))
- }
- }
- // 最終結(jié)果
- val resultedStream = resultStream.map(x => {
- val keys = x._1.split("_")
- val date = keys(0)
- val helperversion = keys(1)
- (date, helperversion, x._2, x._3)
- })
- val resultPath = "D:\\space\\IJ\\bigdata\\src\\main\\scala\\com\\ddxygq\\bigdata\\flink\\streaming\\pvuv\\result"
- resultedStream.writeAsText(resultPath, FileSystem.WriteMode.OVERWRITE)
- env.execute("PvUvCount")
- }
- }
改進(jìn)
其實(shí)這里還是需要set保存uv,難免對(duì)內(nèi)存有壓力,如果我們的集群不大,為了節(jié)省開(kāi)支,我們可以使用外部媒介,如hbase的rowkey唯一性、redis的set數(shù)據(jù)結(jié)構(gòu),都是可以達(dá)到實(shí)時(shí)、快速去重的目的。
參考資料
https://flink.sojb.cn/dev/event_time.htm
lhttp://wuchong.me/blog/2016/05/20/flink-internals-streams-and-operations-on-streams
https://segmentfault.com/a/1190000006235690