Flume is a flexible, scalable, and reliable system for collecting streaming data. The Flume User Guide describes how to configure Flume, and the new Flume Cookbook contains instructions (called recipes) for common Flume use cases. In this post, we present a recipe that describes the common use case of using a Flume node collect Apache 2 web servers logs in order to deliver them to HDFS.
Using Flume Agents for Apache 2.x Web Server Logging
To connect Flume to Apache 2.x servers, you will need to:
- Configure web log file permissions
- Tail the web logs or use piped logs to enable Flume to get data from the web server
This section will step through basic setup on default Ubuntu Lucid and default CentOS 5.5 installations. Then it will describe various ways of integrating Flume.
If You are Using CentOS / Red Hat Apache Servers
By default, CentOS’s Apache writes web logs to files owned by
root and in group
adm in 0644 (
-rw-r–r–) mode. Flume is run as the
flume user, so the Flume node is able to read the logs. Apache on CentOS/Red Hat servers defaults to writing logs to two files:
The simplest way to gather data from these files is to tail the files by configuring Flume nodes to use Flume’s tail source:
If You are Using Ubuntu Apache Servers
By default, Ubuntu servers write web logs to files owned by
root and in group
adm in 0640 (
-rw-r—–) mode. Flume is run as the
flume user and by default will not be able to read the files. One approach to allow the
flume user to read the files is to add it to the
adm group. Apache servers on Ubuntu defaults to writing logs to three files:
The simplest way to gather data from these files is by configuring Flume nodes to use Flume’s tail source:
Getting Log Entries from Piped Log Files
The Apache 2.x’s documentation describes using piped logging with the CustomLog descriptor. Their example uses the
rotatelogs program to periodically write data to new files with a given prefix. Here are some example directives that could be in the httpd.conf/apache2.conf file.
LogFormat “%h %l %u %t \”%r\” %>s %b” common
CustomLog “|/usr/sbin/rotatelogs /var/log/apache2/foo_access_log 3600? common
TIP: In Ubuntu Lucid, these directives are in /etc/apache2/sites-available/default. In CentOS 5.5, these directives are in /etc/httpd/conf/httpd.conf.
These directives configure Apache to write log files in /var/log/apache2/foo_access_log.xxxxx every hour (3600 seconds) using the “common” log format. You can configure a Flume node to use Flume’s tailDir source to read all files without modifying the Apache settings:
- tailDir(“/var/log/apache2/”, “foo_access_log.*”)
The first argument is the directory, and the second is a regex that should match against the file name. tailDir will watch the directory and tail all files that have matching file names.
Using Piped Logs
Instead of writing data to disk and then having Flume read it, you can have Flume ingest data directly from Apache. To do so, modify the web server’s parameters and use its piped log feature by adding some directives to the Apache server’s configuration:
CustomLog "|flume node_nowatch -1 -n apache -c \'apache:console|agentBESink(\"collector\");\'" common
CustomLog "|flume node_nowatch -1 -n apache -c \'apache:console|agentDFOSink(\"collector\");\'" common
WARNING: By default, CentOS does not have Java required by the Flume node in user
root‘s path. You can use alternatives to create a managed symlink in /usr/bin/ for the Java executable.
Using piped logs can be more efficient, but is riskier because Flume can deliver messages without saving on disk. Doing this, however, increases the probability of event loss. From a security point of view, this Flume node instance runs as Apache’s user which is often
root according to the Apache manual.
NOTE: You could configure the one-shot mode node to deliver data directly to a collector. This can only be done at the best effort or disk-failover level. The prior examples use Flume nodes in one-shot mode which runs without contacting a master. Unfortunately, it means that one-shot mode cannot directly use the automatic chains or the end-to-end (E2E) reliability mode. This is because the automatic chains are generated by the master and because E2E mode delivers acknowledgements through the master.
However, you can have a one-shot Flume node deliver data to a Flume local node daemon where the reliable E2E mode can be used. In this setup we would have the following Apache directive:
CustomLog “|flume node_nowatch -1 -n apache -c \’apache:console|agentBESink(\”localhost\”, 12345);\’” common
Then you can have a Flume node setup to listen with the following configuration:
node : rpcSource(12345) | agentE2ESink(“collector”);
Since this daemon node is connected to the master, it can use the auto*Chains.
node : rpcSource(12345) | autoE2EChain;
NOTE: End-to-end mode attempts to ensure of delivery of data that enters the E2E sink. In this one-shot-node to reliable-node scenario, data is not safe until it gets to the E2E sink. However, since this is a local connection, it should only fail when the machine or processes fails. The one-shot node can be set to disk failover (DFO) mode in order to reduce the chance of message loss if the daemon node’s configuration changes. Recently, we have committed a lightweight flume logger called flogger that is implemented in C++ by Cloudera Intern, Dani Rayan. This utility can be used in place of the one-shot Flume node to reduce the required resource footprint.
This recipe is one of many from the growing Flume Cookbook. Currently we have written recipes for collecting data from syslog services, from scribe nodes, as well as techniques for testing Flume’s sources and sinks using the command line. If you have a Flume recipe you would like to share or would like to improve some our existing recipes, please contact us. We can add it to the Cookbook and help other users in the community! You can find us on IRC channel #flume at irc.freenode.net, on the flume-users mailing, or meet us in person in New York at Hadoop World 2010!