HDFS中的Java和Python API接口連接

作者：小sen 2021-04-14 08:51:55

大數(shù)據(jù)

今天進(jìn)入HDFS中的Java和Python的API操作，后面可能介紹Scala的相關(guān)的。

[[393001]]

上次介紹了HDFS的簡單操作，今天進(jìn)入HDFS中的Java和Python的API操作，后面可能介紹Scala的相關(guān)的。

在講Java API之前介紹一下使用的IDE——IntelliJ IDEA ，我本人使用的是2020.3 x64的社區(qū)版本。

Java API

創(chuàng)建maven工程，關(guān)于Maven的配置，在IDEA中，Maven下載源必須配置成阿里云。

在對(duì)應(yīng)的D:\apache-maven-3.8.1-bin\apache-maven-3.8.1\conf\settings.xml需要設(shè)置阿里云的下載源。

下面創(chuàng)建maven工程，添加常見的依賴

添加hadoop-client依賴，版本最好和hadoop指定的一致，并添加junit單元測試依賴。

<dependencies> 
  <dependency> 
        <groupId>org.apache.hadoop</groupId> 
        <artifactId>hadoop-common</artifactId> 
        <version>3.1.4</version> 
  </dependency> 
  <dependency> 
        <groupId>org.apache.hadoop</groupId> 
        <artifactId>hadoop-hdfs</artifactId> 
        <version>3.1.4</version> 
  </dependency> 
  <dependency> 
      <groupId>org.apache.hadoop</groupId> 
      <artifactId>hadoop-client</artifactId> 
      <version>3.1.4</version> 
  </dependency> 
  <dependency> 
      <groupId>junit</groupId> 
      <artifactId>junit</artifactId> 
      <version>4.11</version> 
  </dependency> 
</dependencies>

HDFS文件上傳

在這里編寫測試類即可，新建一個(gè)java文件：main.java

這里的FileSyste一開始是本地的文件系統(tǒng)，需要初始化為HDFS的文件系統(tǒng)

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.junit.Test; 
import java.net.URI; 
public class main { 
 
    @Test 
    public void testPut() throws Exception { 
        //   獲取FileSystem類的方法有很多種，這里只寫一種(比較常用的是使URI) 
        Configuration configuration = new Configuration(); 
        // user是Hadoop集群的賬號(hào)，連接端口默認(rèn)9000 
        FileSystem fileSystem = FileSystem.get( 
                new URI("hdfs://192.168.147.128:9000"), 
                configuration, 
                "hadoop"); 
        // 將f:/stopword.txt 上傳到 /user/stopword.txt 
        fileSystem.copyFromLocalFile( 
                new Path("f:/stopword.txt"), new Path("/user/stopword.txt")); 
        fileSystem.close(); 
    } 
}

在對(duì)應(yīng)的HDFS中，就會(huì)看見我剛剛上傳的機(jī)器學(xué)習(xí)相關(guān)的停用詞。

HDFS文件下載

由于每次都需要初始化FileSystem，比較懶的我直接使用@Before每次加載。

HDFS文件下載的API接口是copyToLocalFile，具體代碼如下。

@Test 
public void testDownload() throws Exception { 
    Configuration configuration = new Configuration(); 
    FileSystem fileSystem = FileSystem.get( 
            new URI("hdfs://192.168.147.128:9000"), 
            configuration, 
            "hadoop"); 
    fileSystem.copyToLocalFile( 
            false, 
            new Path("/user/stopword.txt"), 
            new Path("stop.txt"), 
            true); 
    fileSystem.close(); 
    System.out.println("over"); 
}

Python API

下面主要介紹hdfs，參考：https://hdfscli.readthedocs.io/

我們通過命令pip install hdfs安裝hdfs庫，在使用hdfs前，使用命令hadoop fs -chmod -R 777 / 對(duì)當(dāng)前目錄及目錄下所有的文件賦予可讀可寫可執(zhí)行權(quán)限。

>>> from hdfs.client import Client 
>>> #2.X版本port 使用50070  3.x版本port 使用9870 
>>> client = Client('http://192.168.147.128:9870')   
>>> client.list('/')   #查看hdfs /下的目錄 
['hadoop-3.1.4.tar.gz'] 
>>> client.makedirs('/test') 
>>> client.list('/') 
['hadoop-3.1.4.tar.gz', 'test'] 
>>> client.delete("/test") 
True 
>>> client.download('/hadoop-3.1.4.tar.gz','C:\\Users\\YIUYE\\Desktop') 
'C:\\Users\\YIUYE\\Desktop\\hadoop-3.1.4.tar.gz' 
>>> client.upload('/','C:\\Users\\YIUYE\\Desktop\\demo.txt') 
>>> client.list('/') 
'/demo.txt' 
>>> client.list('/') 
['demo.txt', 'hadoop-3.1.4.tar.gz'] 
>>> # 上傳demo.txt 內(nèi)容：Hello \n hdfs 
>>> with client.read("/demo.txt") as reader: 
...          print(reader.read()) 
b'Hello \r\nhdfs\r\n'