一文帶你了解大數(shù)據(jù)基石-Hadoop

作者：程哥編程 2024-05-07 08:49:36

基于存儲(chǔ)以及計(jì)算Hadoop量大兩大功能模塊-分布式存儲(chǔ)HDFS以及分布式計(jì)算MapReduce，下面分別針對(duì)這兩大功能模塊詳細(xì)介紹。

當(dāng)前的互聯(lián)網(wǎng)的時(shí)代，信息爆炸的年代，抓住了風(fēng)口那么距離成功也就走了一半啦！這個(gè)風(fēng)口如何抓住我不知道，但是如何分析用戶(hù)的喜好以及其他行為卻是唾手可得的，用戶(hù)的行為如何存儲(chǔ)如何分析就是本文的下面要講的知識(shí)點(diǎn)。

那么為什么要用到本文提到的hadoop組件，這里啰嗦兩句，因?yàn)樾畔⒈ū厝粫?huì)帶來(lái)海量的數(shù)據(jù)，那么單機(jī)服務(wù)器勢(shì)必會(huì)造成存儲(chǔ)以及計(jì)算瓶頸，那么hadoop組件就是在做這兩件事情的。

基于存儲(chǔ)以及計(jì)算hadoop量大兩大功能模塊-分布式存儲(chǔ)HDFS以及分布式計(jì)算MapReduce，下面分別針對(duì)這兩大功能模塊詳細(xì)介紹。

hadoop之分布式存儲(chǔ)HDFS

首先呢，這個(gè)HDFS的設(shè)計(jì)靈感來(lái)自google的GFS論文，設(shè)計(jì)的目的就是應(yīng)付海量的數(shù)據(jù)存儲(chǔ)（PB|TB）

HDFS有如下特點(diǎn)：

HDFS適合處理大規(guī)模數(shù)據(jù)，如：TB和PB，可以處理百萬(wàn)規(guī)模以上的文件數(shù)量，使用場(chǎng)景是一次寫(xiě)入、多次讀取場(chǎng)景。
HDFS將文件線性按字節(jié)切分成多個(gè)block塊進(jìn)行存儲(chǔ)，每個(gè)block塊默認(rèn)128M。
每個(gè)block塊默認(rèn)有3個(gè)副本，提高容錯(cuò)性，如果一個(gè)副本丟失不可用，后續(xù)可以自動(dòng)恢復(fù)。
HDFS適合大文件寫(xiě)入，不適合大量小文件寫(xiě)入，因?yàn)樾∥募郚ameNode要使用更多內(nèi)存來(lái)維護(hù)存儲(chǔ)文件目錄和block信息。此外，讀取大量小文件時(shí)，文件尋址時(shí)間要大于文件讀取時(shí)間，違反HDFS設(shè)計(jì)目標(biāo)。
HDFS不支持并發(fā)寫(xiě)入數(shù)據(jù)，一個(gè)文件只能有一個(gè)寫(xiě)，不能多個(gè)線程同時(shí)寫(xiě)。
HDFS數(shù)據(jù)寫(xiě)入后不支持修改，只支持append追加。

HDFS是一個(gè)主從（Master/Slaves）架構(gòu)，由一個(gè)NameNode和一些DataNode組成，下圖是HDFS架構(gòu)：

HDFS 架構(gòu)圖

從上圖看NameNode節(jié)點(diǎn)存儲(chǔ)所有文件的與數(shù)據(jù)信息以及地址信息充當(dāng)著目錄索引的作用，SecondaryNameNode 節(jié)點(diǎn)則可以認(rèn)為是NameNode的預(yù)備節(jié)點(diǎn)，DataNode節(jié)點(diǎn)則負(fù)責(zé)著文件以及文件副本的保存，正是有著副本以及Secondary NameNode節(jié)點(diǎn)的存在，保障了整個(gè)系統(tǒng)的高可用，下面則有一個(gè)簡(jiǎn)單的連接HDFS的例子。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

public class HdfsExample {
  
  public static void main(String[] args) {
    try {
      // 創(chuàng)建Hadoop配置對(duì)象
      Configuration conf = new Configuration();
      
      // 獲取Hadoop文件系統(tǒng)實(shí)例
      FileSystem fs = FileSystem.get(conf);
      
      // 定義要操作的文件路徑
      String hdfsPath = "/user/hadoop/sample.txt";
      Path path = new Path(hdfsPath);
      
      // 檢查文件是否存在
      boolean exists = fs.exists(path);
      System.out.println("文件是否存在：" + exists);
      
      // 在HDFS上創(chuàng)建一個(gè)新文件
      if (!exists) {
        OutputStream os = fs.create(path);
        System.out.println("文件創(chuàng)建成功");
        os.close();
      }
      
      // 將本地文件上傳到HDFS
      String localFilePath = "/path/to/local/file.txt";
      Path localPath = new Path(localFilePath);
      fs.copyFromLocalFile(localPath, path);
      System.out.println("文件上傳成功");
      
      // 從HDFS中讀取文件內(nèi)容
      InputStream is = fs.open(path);
      byte[] buffer = new byte[1024];
      int bytesRead = is.read(buffer);
      while (bytesRead > 0) {
        System.out.println(new String(buffer, 0, bytesRead));
        bytesRead = is.read(buffer);
      }
      is.close();
      
      // 刪除HDFS上的文件
      boolean deleted = fs.delete(path, false);
      System.out.println("文件是否刪除成功：" + deleted);
      
      // 關(guān)閉Hadoop文件系統(tǒng)實(shí)例
      fs.close();
      
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

hadoop之分布式計(jì)算之MapReduce

此功能的靈感同樣是來(lái)自google的同名論文（牛逼的永遠(yuǎn)是寫(xiě)論文的呀）。

此功能模塊的牛逼之處就在于它的編程思想，那么就已worldcount實(shí)例簡(jiǎn)單講下。

假設(shè)現(xiàn)在有兩個(gè)文件，數(shù)據(jù)如下，假如我們要讀取文件中的數(shù)據(jù)進(jìn)行wordcount統(tǒng)計(jì)，那么需要進(jìn) 行如下步驟。

以上過(guò)程演示的就是MapReduce處理數(shù)據(jù)的大體流程，MapReduce模型由兩個(gè)主要階段組成： Map階段和Reduce階段：

Map階段：

在Map階段中，輸入數(shù)據(jù)被分割成若干個(gè)獨(dú)立的塊，并由多個(gè)Mapper任務(wù)并行處理，每個(gè)Mapper 任務(wù)都會(huì)執(zhí)行用戶(hù)定義的map函數(shù)，將輸入數(shù)據(jù)轉(zhuǎn)換成一系列鍵-值對(duì)的形式（Key-Value Pairs），這些鍵-值對(duì)被中間存儲(chǔ)，以供Reduce階段使用。 Map階段主要是對(duì)數(shù)據(jù)進(jìn)行映射變換，讀取一條數(shù)據(jù)可以返回一條或者多條K,V格式數(shù)據(jù)。

Reduce階段：

在Reduce階段中，所有具有相同鍵的鍵-值對(duì)會(huì)被分配到同一個(gè)Reducer任務(wù)上，Reducer任務(wù)會(huì)執(zhí) 行用戶(hù)定義的reduce函數(shù)，對(duì)相同鍵的值進(jìn)行聚合、匯總或其他操作，生成最終的輸出結(jié)果， Reduce階段也可以由多個(gè)Reduce Task并行執(zhí)行。 Reduce階段主要對(duì)相同key的數(shù)據(jù)進(jìn)行聚合，最終對(duì)相同key的數(shù)據(jù)生成一個(gè)結(jié)果，最終寫(xiě)出到磁盤(pán) 文件中。

下面就是一個(gè)簡(jiǎn)單的MapReduce代碼示例：

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
  
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
  
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

本文到此就結(jié)束了，當(dāng)然hadoop肯定不止這樣簡(jiǎn)單的內(nèi)容，感興趣的小伙伴可以到apache官網(wǎng)學(xué)習(xí)下，畢竟這個(gè)是大廠大數(shù)據(jù)必會(huì)的東西。

責(zé)任編輯：姜華來(lái)源：今日頭條