Tuesday, September 3, 2013

Crawling the Web with Nutch on Amazon Elastic Map Reduce (EMR).

Nutch and EMR
The obvious choice to crawl the web seems to be Nutch these days. It is an mature apache project which span off Hadoop and Lucene, two arguably more successful apache projects.
Another sound choice is to run Nutch in the cloud as it can be run as a Hadoop map reduce job hence benefiting from the cloud's elasticity. The AWS EMR is Amazon's managed Hadoop service and its pay-per-use model seems to be perfect for either start-ups or corporations experimenting with gathering knowledge from the web. I, however, could not find a write up on how to get Nutch runningon EMR anywhere I looked, hence this post and this github project.

Monday, January 7, 2013

Visualizing Geohash

I recently had to process data about places, or points of interest, around the globe. It was intuitive to me to try organize these records by their location. The standard way to group hadoop records is to make the records in the same group share the key prefix. I needed to somehow convert a latitude, longitude in a string of characters and that is when found Geohash. It is a well known dimensionality reduction technique that transforms the two dimension spatial point (latitude,longitude) into a alphanumerical string, or hash.
I'll describe the  details of the points of interest processing in a future post. In this post, I will describe Geohash visually because I believe it is easier for some people (like myself) to understand and it would had saved me a some time had anyone else done it.