Monday, March 16, 2015

Using SparkNotebook for Kaggle Data Analysis - Part I

Some Kaggle challenges have data that does not fit in the memory of personal computers or they do fit but are painfully slow to do any exploratory or predictive data analysis.
Spark Notebook is an open source project that  will simplify Big Data Analytics. It facilitates to start and use an IPython notebook connected to an Spark cluster to run SQL against the data and run machine learning algorithms.
This post is an introduction on how to use spark notebook to work on Kaggle challenges. The tool, however, is generic enough to work with different data analytics problems.

Tuesday, September 3, 2013

Crawling the Web with Nutch on Amazon Elastic Map Reduce (EMR).

Nutch and EMR
The obvious choice to crawl the web seems to be Nutch these days. It is an mature apache project which span off Hadoop and Lucene, two arguably more successful apache projects.
Another sound choice is to run Nutch in the cloud as it can be run as a Hadoop map reduce job hence benefiting from the cloud's elasticity. The AWS EMR is Amazon's managed Hadoop service and its pay-per-use model seems to be perfect for either start-ups or corporations experimenting with gathering knowledge from the web. I, however, could not find a write up on how to get Nutch runningon EMR anywhere I looked, hence this post and this github project.

Monday, January 7, 2013

Visualizing Geohash

I recently had to process data about places, or points of interest, around the globe. It was intuitive to me to try organize these records by their location. The standard way to group hadoop records is to make the records in the same group share the key prefix. I needed to somehow convert a latitude, longitude in a string of characters and that is when found Geohash. It is a well known dimensionality reduction technique that transforms the two dimension spatial point (latitude,longitude) into a alphanumerical string, or hash.
I'll describe the  details of the points of interest processing in a future post. In this post, I will describe Geohash visually because I believe it is easier for some people (like myself) to understand and it would had saved me a some time had anyone else done it.