自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

Flink實(shí)時(shí)計(jì)算Pv、Uv的幾種方法

開(kāi)發(fā) 前端
KeyedStream可以轉(zhuǎn)換為WindowedStream,DataStream不能直接轉(zhuǎn)換為WindowedStream,WindowedStream可以直接轉(zhuǎn)換為DataStream。各種流之間雖然不能相互直接轉(zhuǎn)換,但是都可以通過(guò)先轉(zhuǎn)換為DataStream,再轉(zhuǎn)換為其它流的方法來(lái)實(shí)現(xiàn)。

[[403901]]

本文轉(zhuǎn)載自微信公眾號(hào)「Java大數(shù)據(jù)與數(shù)據(jù)倉(cāng)庫(kù)」,作者柯少爺。轉(zhuǎn)載本文請(qǐng)聯(lián)系Java大數(shù)據(jù)與數(shù)據(jù)倉(cāng)庫(kù)公眾號(hào)。

實(shí)時(shí)統(tǒng)計(jì)pv、uv是再常見(jiàn)不過(guò)的大數(shù)據(jù)統(tǒng)計(jì)需求了,前面出過(guò)一篇SparkStreaming實(shí)時(shí)統(tǒng)計(jì)pv,uv的案例,這里用Flink實(shí)時(shí)計(jì)算pv,uv。

我們需要統(tǒng)計(jì)不同數(shù)據(jù)類(lèi)型每天的pv,uv情況,并且有如下要求.

  • 每秒鐘要輸出最新的統(tǒng)計(jì)結(jié)果;
  • 程序永遠(yuǎn)跑著不會(huì)停,所以要定期清理內(nèi)存里的過(guò)時(shí)數(shù)據(jù);
  • 收到的消息里的時(shí)間字段并不是按照順序嚴(yán)格遞增的,所以要有一定的容錯(cuò)機(jī)制;
  • 訪問(wèn)uv并不一定每秒鐘都會(huì)變化,重復(fù)輸出對(duì)IO是巨大的浪費(fèi),所以要在uv變更時(shí)在一秒內(nèi)輸出結(jié)果,未變更時(shí)不輸出;

Flink數(shù)據(jù)流上的類(lèi)型和操作

DataStream是flink流處理最核心的數(shù)據(jù)結(jié)構(gòu),其它的各種流都可以直接或者間接通過(guò)DataStream來(lái)完成相互轉(zhuǎn)換,一些常用的流直接的轉(zhuǎn)換關(guān)系如圖:

可以看出,DataStream可以與KeyedStream相互轉(zhuǎn)換,KeyedStream可以轉(zhuǎn)換為WindowedStream,DataStream不能直接轉(zhuǎn)換為WindowedStream,WindowedStream可以直接轉(zhuǎn)換為DataStream。各種流之間雖然不能相互直接轉(zhuǎn)換,但是都可以通過(guò)先轉(zhuǎn)換為DataStream,再轉(zhuǎn)換為其它流的方法來(lái)實(shí)現(xiàn)。

在這個(gè)計(jì)算pv,uv的需求中就主要用到DataStream、KeyedStream以及WindowedStream這些數(shù)據(jù)結(jié)構(gòu)。

這里需要用到window和watermark,使用窗口把數(shù)據(jù)按天分割,使用watermark可以通過(guò)“水位”來(lái)定期清理窗口外的遲到數(shù)據(jù),起到清理內(nèi)存的作用。

業(yè)務(wù)代碼

我們的數(shù)據(jù)是json類(lèi)型的,含有date,version,guid這3個(gè)字段,在實(shí)時(shí)統(tǒng)計(jì)pv,uv這個(gè)功能中,其它字段可以直接丟掉,當(dāng)然了在離線數(shù)據(jù)倉(cāng)庫(kù)中,所有有含義的業(yè)務(wù)字段都是要保留到hive當(dāng)中的。其它相關(guān)概念就不說(shuō)了,會(huì)專(zhuān)門(mén)介紹,這里直接上代碼吧。

  1. <?xml version="1.0" encoding="UTF-8"?> 
  2. <project xmlns="http://maven.apache.org/POM/4.0.0" 
  3.          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  4.          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
  5.     <modelVersion>4.0.0</modelVersion> 
  6.  
  7.     <groupId>com.ddxygq</groupId> 
  8.     <artifactId>bigdata</artifactId> 
  9.     <version>1.0-SNAPSHOT</version> 
  10.  
  11.     <properties> 
  12.         <scala.version>2.11.8</scala.version> 
  13.         <flink.version>1.7.0</flink.version> 
  14.         <pkg.name>bigdata</pkg.name
  15.     </properties> 
  16.  
  17.     <dependencies> 
  18.         <dependency> 
  19.             <groupId>org.apache.flink</groupId> 
  20.             <artifactId>flink-scala_2.11</artifactId> 
  21.             <version>{flink.version}</version> 
  22.   </dependency> 
  23.         <dependency> 
  24.             <groupId>org.apache.flink</groupId> 
  25.             <artifactId>flink-streaming-scala_2.11</artifactId> 
  26.             <version>flink.version</version> 
  27.   </dependency> 
  28.    
  29.         <dependency> 
  30.             <groupId>org.apache.flink</groupId> 
  31.             <artifactId>flink-streaming-java_2.11</artifactId> 
  32.             <version>{flink.version}</version> 
  33.         </dependency> 
  34.         <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.8 --> 
  35.         <dependency> 
  36.             <groupId>org.apache.flink</groupId> 
  37.             <artifactId>flink-connector-kafka-0.10_2.11</artifactId> 
  38.             <version>flink.version</version> 
  39.   </dependency> 
  40.  
  41.     <build> 
  42.         <!--測(cè)試代碼和文件--> 
  43.         <!--<testSourceDirectory>{basedir}/src/test</testSourceDirectory>--> 
  44.         <finalName>basedir/src/test</testSourceDirectory>−−><finalName>{pkg.name}</finalName> 
  45.         <sourceDirectory>src/main/java</sourceDirectory> 
  46.         <resources> 
  47.             <resource> 
  48.                 <directory>src/main/resources</directory> 
  49.                 <includes> 
  50.                     <include>*.properties</include> 
  51.                     <include>*.xml</include> 
  52.                 </includes> 
  53.                 <filtering>false</filtering> 
  54.             </resource> 
  55.         </resources> 
  56.         <plugins> 
  57.             <!-- 跳過(guò)測(cè)試插件--> 
  58.             <plugin> 
  59.                 <groupId>org.apache.maven.plugins</groupId> 
  60.                 <artifactId>maven-surefire-plugin</artifactId> 
  61.                 <configuration> 
  62.                     <skip>true</skip> 
  63.                 </configuration> 
  64.             </plugin> 
  65.             <!--編譯scala插件--> 
  66.             <plugin> 
  67.                 <groupId>org.scala-tools</groupId> 
  68.                 <artifactId>maven-scala-plugin</artifactId> 
  69.                 <version>2.15.2</version> 
  70.                 <executions> 
  71.                     <execution> 
  72.                         <goals> 
  73.                             <goal>compile</goal> 
  74.                             <goal>testCompile</goal> 
  75.                         </goals> 
  76.                     </execution> 
  77.                 </executions> 
  78.             </plugin> 
  79.         </plugins> 
  80.     </build> 
  81. </project> 

主要代碼,主要使用scala開(kāi)發(fā):

  1. package com.ddxygq.bigdata.flink.streaming.pvuv 
  2.  
  3. import java.util.Properties 
  4.  
  5. import com.alibaba.fastjson.JSON 
  6. import org.apache.flink.runtime.state.filesystem.FsStateBackend 
  7. import org.apache.flink.streaming.api.CheckpointingMode 
  8. import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor 
  9. import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment} 
  10. import org.apache.flink.streaming.api.windowing.time.Time 
  11. import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger 
  12. import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010 
  13. import org.apache.flink.streaming.util.serialization.SimpleStringSchema 
  14. import org.apache.flink.streaming.api.scala.extensions._ 
  15. import org.apache.flink.api.scala._ 
  16.  
  17. /** 
  18.   * @ Author: keguang 
  19.   * @ Date: 2019/3/18 17:34 
  20.   * @ version: v1.0.0 
  21.   * @ description:  
  22.   */ 
  23. object PvUvCount { 
  24.   def main(args: Array[String]): Unit = { 
  25.     val env = StreamExecutionEnvironment.getExecutionEnvironment 
  26.  
  27.     // 容錯(cuò) 
  28.     env.enableCheckpointing(5000) 
  29.     env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) 
  30.     env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/flink/tagApp")) 
  31.  
  32.     // kafka 配置 
  33.     val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181" 
  34.     val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092" 
  35.     val TRANSACTION_GROUP = "flink-count" 
  36.     val TOPIC_NAME = "flink" 
  37.     val kafkaProps = new Properties() 
  38.     kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST) 
  39.     kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS) 
  40.     kafkaProps.setProperty("group.id", TRANSACTION_GROUP) 
  41.  
  42.     // watrmark 允許數(shù)據(jù)延遲時(shí)間 
  43.     val MaxOutOfOrderness = 86400 * 1000L 
  44.      
  45.     // 消費(fèi)kafka數(shù)據(jù) 
  46.     val streamData: DataStream[(String, String, String)] = env.addSource( 
  47.       new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps) 
  48.     ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) { 
  49.       override def extractTimestamp(element: String): Long = { 
  50.         val t = JSON.parseObject(element) 
  51.         val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time"
  52.         time.toLong 
  53.       } 
  54.     }).map(x => { 
  55.       var date = "error" 
  56.       var guid = "error" 
  57.       var helperversion = "error" 
  58.       try { 
  59.         val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message")) 
  60.         val datetime = messageJsonObject.getString("time"
  61.         date = datetime.split(" ")(0) 
  62.         // hour = datetime.split(" ")(1).substring(0, 2) 
  63.         val decrypted_data_string = messageJsonObject.getString("decrypted_data"
  64.         if (!"".equals(decrypted_data_string)) { 
  65.           val decrypted_data = JSON.parseObject(decrypted_data_string) 
  66.           guid = decrypted_data.getString("guid").trim 
  67.           helperversion = decrypted_data.getString("helperversion"
  68.         } 
  69.       } catch { 
  70.         case e: Exception => { 
  71.           println(e) 
  72.         } 
  73.       } 
  74.       (date, helperversion, guid) 
  75.     }) 
  76.     // 這上面是設(shè)置watermark并解析json部分 
  77.     // 聚合窗口中的數(shù)據(jù),可以研究下applyWith這個(gè)方法和OnWindowedStream這個(gè)類(lèi) 
  78.     val resultStream = streamData.keyBy(x => { 
  79.       x._1 + x._2 
  80.     }).timeWindow(Time.days(1)) 
  81.       .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1))) 
  82.       .applyWith(("", List.empty[Int], Set.empty[Int], 0L, 0L))( 
  83.         foldFunction = { 
  84.           case ((_, list, set, _, 0), item) => { 
  85.             val date = item._1 
  86.             val helperversion = item._2 
  87.             val guid = item._3 
  88.             (date + "_" + helperversion, guid.hashCode +: list, set + guid.hashCode, 0L, 0L) 
  89.           } 
  90.         } 
  91.         , windowFunction = { 
  92.           case (key, window, result) => { 
  93.             result.map { 
  94.               case (leixing, list, set, _, _) => { 
  95.                 (leixing, list.sizeset.size, window.getStart, window.getEnd) 
  96.               } 
  97.             } 
  98.           } 
  99.         } 
  100.       ).keyBy(0) 
  101.       .flatMapWithState[(String, IntInt, Long, Long),(IntInt)]{ 
  102.       case ((key, numpv, numuv, beginend), curr) => 
  103.  
  104.         curr match { 
  105.           case Some(numCurr) if numCurr == (numuv, numpv) => 
  106.             (Seq.empty, Some((numuv, numpv))) //如果之前已經(jīng)有相同的數(shù)據(jù),則返回空結(jié)果 
  107.           case _ => 
  108.             (Seq((key, numpv, numuv, beginend)), Some((numuv, numpv))) 
  109.         } 
  110.     } 
  111.  
  112.     // 最終結(jié)果 
  113.     val resultedStream = resultStream.map(x => { 
  114.       val keys = x._1.split("_"
  115.       val date = keys(0) 
  116.       val helperversion = keys(1) 
  117.       (date, helperversion, x._2, x._3) 
  118.     }) 
  119.  
  120.     resultedStream.print() 
  121.     env.execute("PvUvCount"
  122.  
  123.   } 

使用List集合的size保存pv,使用Set集合的size保存uv,從而達(dá)到實(shí)時(shí)統(tǒng)計(jì)pv,uv的目的。

這里用了幾個(gè)關(guān)鍵的函數(shù):

applyWith:里面需要的參數(shù),初始狀態(tài)變量,和foldFunction ,windowFunction ;

存在的問(wèn)題

顯然,當(dāng)數(shù)據(jù)量很大的時(shí)候,這個(gè)List集合和Set集合會(huì)很大,并且這里的pv是否可以不用List來(lái)存儲(chǔ),而是通過(guò)一個(gè)狀態(tài)變量,不斷做累加,對(duì)應(yīng)操作就是更新?tīng)顟B(tài)來(lái)完成。

改進(jìn)版

使用了一個(gè)計(jì)數(shù)器來(lái)存儲(chǔ)pv的值。

  1. packagecom.ddxygq.bigdata.flink.streaming.pvuv 
  2.  
  3. import java.util.Properties 
  4.  
  5. import com.alibaba.fastjson.JSON 
  6. import org.apache.flink.api.common.accumulators.IntCounter 
  7. import org.apache.flink.runtime.state.filesystem.FsStateBackend 
  8. import org.apache.flink.streaming.api.CheckpointingMode 
  9. import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor 
  10. import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment} 
  11. import org.apache.flink.streaming.api.windowing.time.Time 
  12. import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger 
  13. import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010 
  14. import org.apache.flink.streaming.util.serialization.SimpleStringSchema 
  15. import org.apache.flink.streaming.api.scala.extensions._ 
  16. import org.apache.flink.api.scala._ 
  17. import org.apache.flink.core.fs.FileSystem 
  18.  
  19. object PvUv2 { 
  20.   def main(args: Array[String]): Unit = { 
  21.     val env = StreamExecutionEnvironment.getExecutionEnvironment 
  22.  
  23.     // 容錯(cuò) 
  24.     env.enableCheckpointing(5000) 
  25.     env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) 
  26.     env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/streaming/counter")) 
  27.  
  28.     // kafka 配置 
  29.     val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181" 
  30.     val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092" 
  31.     val TRANSACTION_GROUP = "flink-count" 
  32.     val TOPIC_NAME = "flink" 
  33.     val kafkaProps = new Properties() 
  34.     kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST) 
  35.     kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS) 
  36.     kafkaProps.setProperty("group.id", TRANSACTION_GROUP) 
  37.  
  38.     // watrmark 允許數(shù)據(jù)延遲時(shí)間 
  39.     val MaxOutOfOrderness = 86400 * 1000L 
  40.  
  41.     val streamData: DataStream[(String, String, String)] = env.addSource( 
  42.       new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps) 
  43.     ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) { 
  44.       override def extractTimestamp(element: String): Long = { 
  45.         val t = JSON.parseObject(element) 
  46.         val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time"
  47.         time.toLong 
  48.       } 
  49.     }).map(x => { 
  50.       var date = "error" 
  51.       var guid = "error" 
  52.       var helperversion = "error" 
  53.       try { 
  54.         val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message")) 
  55.         val datetime = messageJsonObject.getString("time"
  56.         date = datetime.split(" ")(0) 
  57.         // hour = datetime.split(" ")(1).substring(0, 2) 
  58.         val decrypted_data_string = messageJsonObject.getString("decrypted_data"
  59.         if (!"".equals(decrypted_data_string)) { 
  60.           val decrypted_data = JSON.parseObject(decrypted_data_string) 
  61.           guid = decrypted_data.getString("guid").trim 
  62.           helperversion = decrypted_data.getString("helperversion"
  63.         } 
  64.       } catch { 
  65.         case e: Exception => { 
  66.           println(e) 
  67.         } 
  68.       } 
  69.       (date, helperversion, guid) 
  70.     }) 
  71.  
  72.     val resultStream = streamData.keyBy(x => { 
  73.       x._1 + x._2 
  74.     }).timeWindow(Time.days(1)) 
  75.       .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1))) 
  76.       .applyWith(("", new IntCounter(), Set.empty[Int], 0L, 0L))( 
  77.         foldFunction = { 
  78.           case ((_, cou, set, _, 0), item) => { 
  79.             val date = item._1 
  80.             val helperversion = item._2 
  81.             val guid = item._3 
  82.             cou.add(1) 
  83.             (date + "_" + helperversion, cou, set + guid.hashCode, 0L, 0L) 
  84.           } 
  85.         } 
  86.         , windowFunction = { 
  87.           case (key, window, result) => { 
  88.             result.map { 
  89.               case (leixing, cou, set, _, _) => { 
  90.                 (leixing, cou.getLocalValue, set.size, window.getStart, window.getEnd) 
  91.               } 
  92.             } 
  93.           } 
  94.         } 
  95.       ).keyBy(0) 
  96.       .flatMapWithState[(String, IntInt, Long, Long),(IntInt)]{ 
  97.       case ((key, numpv, numuv, beginend), curr) => 
  98.  
  99.         curr match { 
  100.           case Some(numCurr) if numCurr == (numuv, numpv) => 
  101.             (Seq.empty, Some((numuv, numpv))) //如果之前已經(jīng)有相同的數(shù)據(jù),則返回空結(jié)果 
  102.           case _ => 
  103.             (Seq((key, numpv, numuv, beginend)), Some((numuv, numpv))) 
  104.         } 
  105.     } 
  106.  
  107.     // 最終結(jié)果 
  108.     val resultedStream = resultStream.map(x => { 
  109.       val keys = x._1.split("_"
  110.       val date = keys(0) 
  111.       val helperversion = keys(1) 
  112.       (date, helperversion, x._2, x._3) 
  113.     }) 
  114.  
  115.     val resultPath = "D:\\space\\IJ\\bigdata\\src\\main\\scala\\com\\ddxygq\\bigdata\\flink\\streaming\\pvuv\\result" 
  116.     resultedStream.writeAsText(resultPath, FileSystem.WriteMode.OVERWRITE) 
  117.     env.execute("PvUvCount"
  118.  
  119.   } 

改進(jìn)

其實(shí)這里還是需要set保存uv,難免對(duì)內(nèi)存有壓力,如果我們的集群不大,為了節(jié)省開(kāi)支,我們可以使用外部媒介,如hbase的rowkey唯一性、redis的set數(shù)據(jù)結(jié)構(gòu),都是可以達(dá)到實(shí)時(shí)、快速去重的目的。

參考資料

https://flink.sojb.cn/dev/event_time.htm

lhttp://wuchong.me/blog/2016/05/20/flink-internals-streams-and-operations-on-streams

https://segmentfault.com/a/1190000006235690

 

責(zé)任編輯:武曉燕 來(lái)源: Java大數(shù)據(jù)與數(shù)據(jù)倉(cāng)庫(kù)
相關(guān)推薦

2021-06-03 08:10:30

SparkStream項(xiàng)目Uv

2021-11-01 13:11:45

FlinkPvUv

2021-03-10 08:22:47

FlinktopN計(jì)算

2021-07-16 10:55:45

數(shù)倉(cāng)一體Flink SQL

2016-12-28 14:27:24

大數(shù)據(jù)Apache Flin搜索引擎

2015-08-31 14:27:52

2015-07-31 10:35:18

實(shí)時(shí)計(jì)算

2022-12-29 09:13:02

實(shí)時(shí)計(jì)算平臺(tái)

2016-10-16 13:48:54

多維分析 UVPV

2019-06-27 09:12:43

FlinkStorm框架

2025-03-05 08:40:00

RedisJava開(kāi)發(fā)

2010-06-03 08:55:43

LINQ

2013-08-21 11:31:21

iPhone圖片方法

2009-09-18 12:29:55

2010-05-17 15:17:06

MySQL常用操作

2021-03-10 14:04:10

大數(shù)據(jù)計(jì)算技術(shù)

2015-10-09 13:42:26

hbase實(shí)時(shí)計(jì)算

2019-11-21 09:49:29

架構(gòu)運(yùn)維技術(shù)

2017-09-26 09:35:22

2019-10-17 09:25:56

Spark StreaPVUV
點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)