Getting Started: Web Server Log Analysis
Suppose you host a popular e-commerce website. In order to understand your customers better, you want to analyze your Apache web logs to discover how people are finding your site. You’d especially like to determine which of your online ad campaigns are most successful in driving traffic to your online store.
The web server logs, however, are too large to import into a MySQL database, and they are not in a relational format. You need another way to analyze them.
Amazon EMR integrates open-source applications such as Hadoop and Hive with Amazon Web Services to provide a scalable and efficient architecture for analyzing large-scale data, such as Apache web logs.
In the following tutorial, we’ll import data from Amazon S3 and create an Amazon EMR cluster from the AWS Management Console. Then we’ll connect to the master node of the cluster, where we’ll run Hive to query the Apache logs using a simplified SQL syntax.
This tutorial typically takes less than an hour to complete. You pay only for the resources you use. The tutorial includes a cleanup step to help ensure that you don’t incur additional costs. You may also want to review the Pricing topic.
Before you begin, make sure you’ve completed the steps in Getting Set Up.
Click Next to start the tutorial.
Step 1: Create a Cluster Using the Console
This tutorial reflects changes made to the Amazon EMR console in November 2013. If your console screens do not match the images in this guide, switch to the new version by clicking the link that appears at the top of the console:
- Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
- Click Create Cluster.
- In the Cluster Configuration section, type a Cluster name or use the default value of My cluster. Set Termination protection to No and clear the Logging check box.
In a production environment, logging and debugging can be useful tools for analyzing errors or inefficiencies in Amazon EMR steps or programs. For more information on how to use logging and debugging in Amazon EMR, go to Troubleshooting in the Amazon Elastic MapReduce Developer Guide.
- In the Software Configuration section, leave the default Hadoop distribution setting: Amazon and latest AMI version. Under Applications to be installed, leave the defaultHive settings. Click the X to remove Pig from the list.
- In the Hardware Configuration section, leave the default settings. The default instance types, an m1.small master node and two m1.small core nodes, will help keep the cost of this tutorial low.
When you analyze data in a real application, you may want to increase the size or number of these nodes to improve processing power and reduce computational time. You may also want to use spot instances to further reduced your Amazon EC2 costs. For more information about spot instances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduce Developer Guide.
- In the Security and Access section, select the EC2 key pair you created in the preceding step. Leave the default IAM settings.
Leave the default Bootstrap Actions and Steps settings. Bootstrap actions and steps allow you to customize and configure your application. For this tutorial, we will be using Hive, which is already installed on the AMI, so no addition configuration is needed.
- Review the settings. If everything looks correct, click Create cluster.
A summary of your new cluster will appear, with the status STARTING. It will take a few minutes for Amazon EMR to provision the Amazon EC2 instances for your cluster.
Step 2: Connect to the Master Node
When the cluster in the Amazon EMR console is WAITING, the master node is ready for you to connect to it. First you’ll need to get the DNS name of the master node and configure your connection tools and credentials.
- If you’re not currently viewing the Cluster Details page, first select the cluster on the Cluster List page.
On the Cluster Details page, you’ll see the Master public DNS name Make a note of the DNS name; you’ll need it in the next step.
You can use secure shell (SSH) to open a terminal connection to the master node. An SSH application is installed by default on most Linux, Unix, and Mac OS installations. Windows users can use an application called PuTTY to connect to the master node. Platform-specific instructions for configuring a Windows application to open an SSH connection are provided later in this topic.
You must first configure your credentials, or SSH will return an error message saying that your private key file is unprotected, and it will reject the key. You need to do this step only the first time you use the private key to connect.
- Open a terminal window. On most computers running Mac OS X, you’ll find the terminal at Applications/Utilities/Terminal. On many Linux distributions, the path is Applications/Accessories/Terminal.
- Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key owner has permissions to access the key. For example, if you saved the file as
mykeypair.pemin your home directory, you can use this command:
chmod og-rwx ~/mykeypair.pem
- In the terminal window, enter the following command, where the value of the
-iparameter indicates the location of the private key file you saved in of Step 2: Create a Key Pair. In this example, the key is assumed to be in your home directory.
ssh hadoop@master-public-dns-name \ -i ~/mykeypair.pem
- You’ll see a warning that the authenticity of the host can’t be verified. Type yes to continue connecting.
If you’re using a Windows-based computer, you’ll need to install an SSH client in order to connect to the master node. In this tutorial, we’ll use PuTTY. If you have already installed PuTTY and configured your key pair, you can skip this procedure.
- Download PuTTYgen.exe and PuTTY.exe to your computer from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
- Launch PuTTYgen.
- Click Load. Select the PEM file you created earlier. You may have to change the search parameters from file of type “PuTTY Private Key Files (*.ppk) to “All Files (*.*)”.
- Click Open.
- On the PuTTYgen Notice telling you the key was successfully imported, click OK.
- To save the key in the PPK format, click Save private key.
- When PuTTYgen prompts you to save the key without a pass phrase, click Yes.
- Enter a name for your PuTTY private key, such as mykeypair.ppk.
- Start PuTTY.
- In the Category list, click Session. In the Host Name box, type hadoop@
DNS. The input will look similar to firstname.lastname@example.org.
- In the Category list, expand Connection, expand SSH, and then click Auth.
- In the Options controlling SSH authentication pane, click Browse for Private key file for authentication, and then select the private key file that you generated earlier. If you are following this guide, the file name is
- Click Open.
- To connect to the master node, click Open.
- In the PuTTY Security Alert window, click Yes.
For more information about how to install PuTTY and use it to connect to an EC2 instance, go to Connecting to Linux/UNIX Instances from Windows Using PuTTY in theAmazon Elastic Compute Cloud User Guide.
When you’ve successfully connected to the master node via SSH, you’ll see a welcome message and prompt similar to the following:
----------------------------------------------------------------------------- Welcome to Amazon EMR running Hadoop and Debian/Lenny. Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check /mnt/var/log/hadoop/steps for diagnosing step failures. The Hadoop UI can be accessed via the following commands: JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/ ----------------------------------------------------------------------------- hadoop@ip-10-245-190-34:~$
Step 3: Start and Configure Hive
Apache Hive is a data warehouse application you can use to query Amazon EMR cluster data with a SQL-like language. Because Hive was listed in the Applications to be installedwhen we created the cluster, it’s ready to use on the master node.
To use Hive interactively to query the web server log data, you’ll need to load some additional libraries. The additional libraries are contained in a Java archive file named
hive_contrib.jar on the master node. When you load these libraries, Hive bundles them with the map-reduce job that it launches to process your queries.
To learn more about Hive, go to http://hive.apache.org/.
- On the command line of the master node, type
hive, and then press Enter.
- At the
hive>prompt, type the following command, and then press Enter.
hive> add jar /home/hadoop/hive/lib/hive_contrib.jar;
Wait for a confirmation message similar to the following:
Added /home/hadoop/hive/lib/hive_contrib.jar to class path Added resource: /home/hadoop/hive/lib/hive_contrib.jar
Step 4: Create the Hive Table and Load Data into HDFS
In order for Hive to interact with data, it must translate the data from its current format (in the case of Apache web logs, a text file) into a format that can be represented as a database table. Hive does this translation using a serializer/deserializer (SerDe). SerDes exist for a variety of data formats. For information about how to write a custom SerDe, go to the Apache Hive Developer Guide.
The SerDe we’ll use in this example uses regular expressions to parse the log file data. It comes from the Hive open-source community and can be found athttps://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java. (This link is provided for reference only; for the purposes of this tutorial, you do not need to download the SerDe.).
Using this SerDe, we can define the log files as a table, which we’ll query using SQL-like statements later in this tutorial.
- Copy the following multiline command. At the
hivecommand prompt, paste the command, and then press Enter.
CREATE TABLE serde_regex( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) LOCATION 's3://elasticmapreduce/samples/pig-apache/input/';
In the command, the LOCATION parameter specifies the location of a set of sample Apache log files in Amazon S3. To analyze your own Apache web server log files, you would replace the Amazon S3 URL above with the location of your own log files in Amazon S3. To meet the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must contain only lowercase letters, numbers, periods (.), and hyphens (-).
After you run the command above, you should receive a confirmation like this one:
Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDe OK Time taken: 12.56 seconds hive>
Once Hive has loaded the data, the data will persist in HDFS storage as long as the Amazon EMR cluster is running, even if you shut down your Hive session and close the SSH connection.
Step 5: Query Hive
You’re ready to start querying the Apache log file data. Here are some sample queries to run.
Count the number of rows in the Apache webserver log files.
select count(1) from serde_regex;
Return all fields from one row of log file data.
select * from serde_regex limit 1;
Count the number of requests from the host with an IP address of 192.168.1.198.
select count(1) from serde_regex where host="192.168.1.198";
To return query results, Hive translates your query into a Hadoop MapReduce job and runs it on the Amazon EMR cluster. Status messages will appear as the Hadoop job runs.
Hive SQL is a subset of SQL; if you know SQL, you’ll be able to easily create Hive queries. For more information about the query syntax, go to the Hive Language Manual.
Step 6: Clean Up
To prevent your account from accruing additional charges, you should terminate the cluster when you are done with this tutorial. Because you used the cluster interactively, it has to be manually terminated.
- In your SSH window or client, press Click CTRL+C to exit Hive.
- At the SSH command prompt, type
exit, and then press Enter. You can then close the terminal or PuTTY window.
- If you are not already viewing the cluster list, click Cluster List at the top of the Amazon Elastic MapReduce console.
- In the cluster list, select the box to the left of the cluster name, and then click Terminate. In the confirmation pop-up that appears, click Terminate.
The next step is optional. It deletes the key pair you created earlier. You are not charged for key pairs. If you are planning to explore Amazon EMR further or complete the other tutorial in this guide, you should retain the key pair.
- In the Amazon EC2 console navigation pane, select Key Pairs.
- In the content pane, select the key pair you created, then click Delete.
The next step is optional. It deletes two security groups created for you by Amazon EMR when you launched the cluster. You are not charged for security groups. If you are planning to explore Amazon EMR further, you should retain them.
- In the Amazon EC2 console navigation pane, click Security Groups.
- In the content pane, click the ElasticMapReduce-slave security group.
- In the details pane for the ElasticMapReduce-slave security group, click the Inbound tab. Delete all actions that reference ElasticMapReduce. Click Apply Rule Changes.
- In the content pane, click ElasticMapReduce-slave, and then click Delete. Click Yes, Delete to confirm. (This group must be deleted before you can delete the ElasticMapReduce-master group.)
- In the content pane, click ElasticMapReduce-master, and then click Delete. Click Yes, Delete to confirm.