TF-IDF revisited
Remember the buffering problems for the TF-IDF program discussed in a previous post as well as the lecture about MapReduce algorithms from Cloudera‘s free Hadoop training? Thanks to the new joining...
View ArticleVirtual Python environments
Judging from some of the questions about Dumbo development that keep popping up, virtual Python environments are apparently not that widely known and used yet. Therefore, I thought it made sense to...
View ArticleDumbo on Cloudera’s distribution
Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to...
View ArticleMultiple outputs
Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example: from dumbo import run, sumreducer, opt def mapper(key, value): for word in value.split(): yield word,...
View ArticleIntegration with Java code
Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks...
View ArticleDumbo over HBase
This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or...
View ArticleMoving to Hadoop 0.20
We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process: We’re probably going to...
View ArticleDumbo on Amazon EMR
A while ago, I received an email from Andrew in which he wrote: Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so: $ elastic-mapreduce...
View ArticleReading Hadoop records in Python
At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows: Hadoop records → CsvRecordInput...
View ArticleConsuming Dumbo output with Pig
Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time....
View Article