An FTP client that stores content directly into HDFS is good. It allows data from FTP serves to be directly stored into HDFS instead of first copying the data locally and then uploading it into HDFS. The benefits are apparent from an administrative perspective as large datasets can be pulled from FTP servers with minimal human intervention.
At present we are faced with the issue of our data lying in different remote FTP server locations.
This utility essentially provides following benefits
1. The steps of ‘pull data from FTP server’, ‘store locally’, ‘tranfer to HDFS’ and ‘delete local copy’ are converted into 1 step – ‘Pull data and store into HDFS’ .
2. No need to worry about lack of local storage as data goes directly into HDFS.
3. Can be used to run a batch of commands that include pulling data from different FTP servers.
All of this greatly simplifies administrative tasks.
Following program does the job :
public class FTPtoHDFS
public static void main(String args) throws IOException, URISyntaxException
Configuration conf = new Configuration();
FTPFileSystem ftpfs = new FTPFileSystem();
ftpfs.initialize(new URI(“ftp://username:password@host“), conf);
FSDataInputStream fsdin = ftpfs.open(new Path(src), 1000);
OutputStream outputStream=fileSystem.create(new Path(args));
IOUtils.copyBytes(fsdin, outputStream, conf, true);