Discuss New Concept,New Technic,New Tools, Including EAI,BPM,SOA,Tibco,IBM MQ,Tuxedo, Cloud,Hadoop,NoSQL,J2EE,Ruby,Scala,Python, Performance,Scalability,Distributed,HA, Social Network,Machine Learning.

November 23, 2012  Tagged with: , , , ,

original:http://sujitpal.blogspot.com/2011/01/exploring-nutch-20-hbase-storage.html

According to the Nutch2Roadmap Wiki Page, one of the features of (as yet unreleased, but available in SVN) Nutch 2.0 is Storage Abstraction. Instead of segment files, it can use a MySQL or HBase (support for Cassandra is also planned) as its backend datastore.

Support for multiple backends is achieved using GORA, an ORM framework (originally written for Nutch) that works against Column databases. So changing backends would (probably, haven’t looked at the GORA code yet) mean adding the appropriate GORA implementation JAR into Nutch’s classpath.

Currently, even though the code is pre-release, there is a working HBase backend, and adequate documentation on how to set it up. Since we use Cassandra as part of our crawl/indexing infrastructure, I figured it would be worth checking out, so once Nutch 2.0 is out, maybe we could use it with the Cassandra backend.

So this post is basically an attempt to figure out what Nutch does to the HBase datastore as each of its subcommands are run. You can find the list of subcommands here.

The first step is to download Nutch 2.0 and GORA sources, and build them. This page has detailed instructions, which I followed almost to the letter. The only things to remember is to set the GORA backend in conf/nutch-site.xml after generating the nutch runtime.

Two other changes are to set the http.agent.name and http.robots.agents in nutch-default.xml (so nutch actually does the crawl), and the hbase.rootdir in hbase-default.xml to something other than /tmp (to prevent data loss across system restarts).

I just ran a subset of Nutch commands (we use Nutch for crawling, not its indexing and search functionality), and looked at what happened in the HBase datastore as a result. The attempt was to understand what each Nutch command does and correlate it to the code, so I can write similar code to hook into various phases of the Nutch lifecycle.

First, we have to start up HBase so Nutch can write to it. Part of the Nutch/GORA integration instructions was to install HBase, so now we can start up a local instance, and then login to the HBase shell.

  1 2 3 4 5 6 7 8 9 10 11 sujit@cyclone:~$cd /opt/hbase-0.20.6 sujit@cyclone:hbase-0.20.6$ bin/start-hbase.sh localhost: starting zookeeper, logging to /opt/hbase-0.20.6/bin/../logs/hbase-sujit-zookeeper-cyclone.hl.local.out starting master, logging to /opt/hbase-0.20.6/bin/../logs/hbase-sujit-master-cyclone.hl.local.out localhost: starting regionserver, logging to /opt/hbase-0.20.6/bin/../logs/hbase-sujit-regionserver-cyclone.hl.local.out sujit@cyclone:hbase-0.20.6$bin/hbase shell HBase Shell; enter 'help' for list of supported commands. Version: 0.20.6, r965666, Mon Jul 19 15:48:07 PDT 2010 hbase(main):001:0> list 0 row(s) in 0.1090 seconds hbase(main):002:0> We use a single URL (this blog) as the seed URL. So we create a one-line file as shown below:  1 http://sujitpal.blogspot.com/ and then inject this URL into HBase:  1 sujit@cyclone:local$ bin/nutch inject /tmp/seed.txt

This results in a single table called “webpage” being created in HBase, with the following structure. I used list to list the tables, and scan to list the contents of the table. For ease of understanding, I reformatted the output manually into a JSON structure. Each leaf level column (cell in HBase-speak) consists of a (key, timestamp, value) triplet, so we could have written the first leaf more compactly as {f1 : “\x00′\x80\x00″}.

It might help to refer to the conf/gora-hbase-mapping.xml file in your Nutch runtime as you read this. If you haven’t set up Nutch 2.0 locally, then this information is also available in the GORA_HBase wiki page.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 webpage : { key : "com.blogspot.sujitpal:http/", f : { fi : { timestamp : 1293676557658, value : "\x00'\x8D\x00" }, ts : { timestamp : 1293676557658, value : "\x00\x00\x01-5!\x9D\xE5" } }, mk : { _injmrk_ : { timestamp : 1293676557658, value : "y" } }, mtdt : { _csh_ : { timestamp : 1293676557658, value : "x80\x00\x00" } }, s : { s : { timestamp : 1293676557658, value : "x80\x00\x00" } }

I then run the generate command, which generates the fetchlist:

 1 2 3 4 5 6 sujit@cyclone:local$bin/nutch generate GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: done GeneratorJob: generated batch id: 1293732622-2092819984 This creates an additional column “mk:_gnmrk_” containing the batch id, in the webpage table for the record keyed by the seed URL.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 webpage : { key : "com.blogspot.sujitpal:http/", f : { fi : { timestamp : 1293676557658, value : "\x00'\x8D\x00" }, ts : { timestamp : 1293676557658, value : "\x00\x00\x01-5!\x9D\xE5" } }, mk : { _injmrk_ : { timestamp : 1293676557658, value : "y" }, _gnmrk_ : { timestamp=1293732629430, value : "1293732622-2092819984" } }, mtdt : { _csh_ : { timestamp : 1293676557658, value : "x80\x00\x00" } }, s : { s : { timestamp : 1293676557658, value : "x80\x00\x00" } } } Next I ran a fetch with the batch id returned by the generate command:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 sujit@cyclone:local$ bin/nutch fetch 1293732622-2092819984 FetcherJob: starting FetcherJob : timelimit set for : -1 FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob: batchId: 1293732622-2092819984 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 1 records. Hit by time limit :0 fetching http://sujitpal.blogspot.com/ -finishing thread FetcherThread1, activeThreads=1 -finishing thread FetcherThread2, activeThreads=1 -finishing thread FetcherThread3, activeThreads=1 -finishing thread FetcherThread4, activeThreads=1 -finishing thread FetcherThread5, activeThreads=1 -finishing thread FetcherThread6, activeThreads=1 -finishing thread FetcherThread7, activeThreads=1 -finishing thread FetcherThread8, activeThreads=1 -finishing thread FetcherThread9, activeThreads=1 -finishing thread FetcherThread0, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0 -activeThreads=0 FetcherJob: done

This creates some more columns as shown below. As you can see, it creates additional columns under the “f” column family, most notably the raw page content in the “f:cnt” column and a new “h” column family with page header information. It also creates a batch id marker in the “mk” column family.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 webpage : { key : "com.blogspot.sujitpal:http/", f : { bas : { timestamp : 1293732801833, value : "http://sujitpal.blogspot.com/" }, cnt : { timestamp : 1293732801833, value : "DOCTYPE html PUBLIC "-//W3C//DTD X...rest of page content" }, fi : { timestamp : 1293676557658, value : "\x00'\x8D\x00" }, prot : { timestamp : 1293732801833, value : "x02\x00\x00" }, st : { timestamp : 1293732801833, value : "x00\x00\x00\x02" }, ts : { timestamp : 1293676557658, value : "\x00\x00\x01-5!\x9D\xE5" } typ : { timestamp : 1293732801833, value : "application/xhtml+xml" } }, h : { Cache-Control : { timestamp : 1293732801833, value : "private" }, Content-Type : { timestamp : 1293732801833, value : "text/html; charset=UTF-8" }, Date : { timestamp : 1293732801833, value : "Thu, 30 Dec 2010 18:13:21 GMT" }, ETag : { timestamp : 1293732801833, value : 40bdf8b9-8c0a-477e-9ee4-b19995601dde" }, Expires : { timestamp : 1293732801833, value : "Thu, 30 Dec 2010 18:13:21 GMT" }, Last-Modified : { timestamp : 1293732801833, value : "Thu, 30 Dec 2010 15:01:20 GMT" }, Server : { timestamp : 1293732801833, value : "GSE" }, Set-Cookie : { timestamp : 1293732801833, value : "blogger_TID=130c0c57a66d0704;HttpOnly" }, X-Content-Type-Options : { timestamp : 1293732801833, value : "nosniff" }, X-XSS-Protection : { timestamp : 1293732801833, value : "1; mode=block" } }, mk : { _injmrk_ : { timestamp : 1293676557658, value : "y" }, _gnmrk_ : { timestamp=1293732629430, value : "1293732622-2092819984" }, _ftcmrk_ : { timestamp : 1293732801833, value : "1293732622-2092819984" } }, mtdt : { _csh_ : { timestamp : 1293676557658, value : "x80\x00\x00" } }, s : { s : { timestamp : 1293676557658, value : "x80\x00\x00" } } }

Finally we parse the fetched content. This extracts the links and parses the text content out of the HTML.

 1 2 3 4 5 6 sujit@cyclone:local$bin/nutch parse 1293732622-2092819984 ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1293732622-2092819984 ParserJob: success This results in more columns written out to the webpage table. At this point it parses out the links from the page and stores them in the “ol” (outlinks) column family, and the “p” column family, which contains the parsed content for the page.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 webpage : { key : "com.blogspot.sujitpal:http/", f : { bas : { timestamp : 1293732801833, value : "http://sujitpal.blogspot.com/" }, cnt : { timestamp : 1293732801833, value : "DOCTYPE html PUBLIC "-//W3C//DTD X...rest of page content" }, fi : { timestamp : 1293676557658, value : "\x00'\x8D\x00" }, prot : { timestamp : 1293732801833, value : "x02\x00\x00" }, st : { timestamp : 1293732801833, value : "x00\x00\x00\x02" ts : { timestamp : 1293676557658, value : "\x00\x00\x01-5!\x9D\xE5" } typ : { timestamp : 1293732801833, value : "application/xhtml+xml" } }, h : { Cache-Control : { timestamp : 1293732801833, value : "private" }, Content-Type : { timestamp : 1293732801833, value : "text/html; charset=UTF-8" }, Date : { timestamp : 1293732801833, value : "Thu, 30 Dec 2010 18:13:21 GMT" }, ETag : { timestamp : 1293732801833, value : 40bdf8b9-8c0a-477e-9ee4-b19995601dde" }, Expires : { timestamp : 1293732801833, value : "Thu, 30 Dec 2010 18:13:21 GMT" }, Last-Modified : { timestamp : 1293732801833, value : "Thu, 30 Dec 2010 15:01:20 GMT" }, Server : { timestamp : 1293732801833, value : "GSE" }, Set-Cookie : { timestamp : 1293732801833, value : "blogger_TID=130c0c57a66d0704;HttpOnly" }, X-Content-Type-Options : { timestamp : 1293732801833, value : "nosniff" }, X-XSS-Protection : { timestamp : 1293732801833, value : "1; mode=block" } }, mk : { _injmrk_ : { timestamp : 1293676557658, value : "y" }, _gnmrk_ : { timestamp=1293732629430, value : "1293732622-2092819984" }, _ftcmrk_ : { timestamp : 1293732801833, value : "1293732622-2092819984" }, __prsmrk__ : { timestamp : 1293732957501, value : "1293732622-2092819984" } }, mtdt : { _csh_ : { timestamp : 1293676557658, value : "x80\x00\x00" } }, s : { s : { timestamp : 1293676557658, value : "x80\x00\x00" } } ol : { http://pagead2.googlesyndication.com/pagead/show_ads.js : { timestamp : 1293732957501, value : "" }, http://sujitpal.blogspot.com/ : { timestamp : 1293732957501, value : "Home" }, http/ column=ol:http://sujitpal.blogspot.com/2005_03_01_archive.html : { timestamp : 1293732957501, value : "March" }, // ... (more outlinks below) ... }, p : { c : { timestamp : 1293732957501, value : "Salmon Run skip to main ... (rest of parsed content)" }, sig : { timestamp : 1293732957501, value="cW\xA5\xB7\xDD\xD3\xBF\x80oYR8\x1F\ x80\x16" }, st : { timestamp : 1293732957501, value : "\x02\x00\x00" }, t : { timestamp : 1293732957501, value : "Salmon Run" }, s : { timestamp : 1293732629430, value : "?\x80\x00\x00" } } } We then run the updatedb command to add the outlinks discovered during the parse to the list of URLs to be fetched.  1 2 3 sujit@cyclone:local$ bin/nutch updatedb DbUpdaterJob: starting DbUpdaterJob: done

This results in 152 rows in the HBase table. Each of the additional rows correspond to the outlinks discovered during the parse stage above.

 1 2 3 4 hbase(main):010:0> scan "webpage" ... 152 row(s) in 1.0400 seconds hbase(main):011:0>`

We can then go back to doing fetch, generate, parse and update until we are done crawling to the desired depth.

Thats all for today. Happy New Year and hope you all had fun during the holidays. As I have mentioned above, this exercise was for me to understand what Nutch does to the HBase datastore when each command is invoked. In coming weeks, I plan on using this information to write some plugins that would drop “user” data into the database, and use it in later steps.