[repost ]FTP to HDFS


An FTP client that stores content directly into HDFS is good. It allows data from FTP serves to be directly stored into HDFS instead of first copying the data locally and then uploading it into HDFS. The benefits are apparent from an administrative perspective as large datasets can be pulled from FTP servers with minimal human intervention.

This will greatly simplify data being pulled from FTP Servers to HDFS. This also makes it faster as we reduce one hop into local file system.

At present we are faced with the issue of our data lying in different remote FTP server locations.

This utility essentially provides following benefits
1. The steps of ‘pull data from FTP server’, ‘store locally’, ‘tranfer to HDFS’ and ‘delete local copy’ are converted into 1 step – ‘Pull data and store into HDFS’ .
2. No need to worry about lack of local storage as data goes directly into HDFS.
3. Can be used to run a batch of commands that include pulling data from different FTP servers.

All of this greatly simplifies administrative tasks.

Thanks to Ankur for fixing issue HADOOP-3246.

Following program does the job :


import java.io.IOException;

import java.io.OutputStream;

import java.net.URI;

import java.net.URISyntaxException;


import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.fs.ftp.FTPFileSystem;

import org.apache.hadoop.io.IOUtils;


public class FTPtoHDFS


public static void main(String[] args) throws IOException, URISyntaxException


String src=”test1.txt”;

Configuration conf = new Configuration();


FTPFileSystem ftpfs = new FTPFileSystem();


ftpfs.initialize(new URI(“ftp://username:password@host“), conf);

FSDataInputStream fsdin = ftpfs.open(new Path(src), 1000);

FileSystem fileSystem=FileSystem.get(conf);

OutputStream outputStream=fileSystem.create(new Path(args[0]));

IOUtils.copyBytes(fsdin, outputStream, conf, true);