Dumbotics » Tips and tricks

↧

Image may be NSFW.
Clik here to view.

TF-IDF revisited

May 17, 2009, 1:54 am

Remember the buffering problems for the TF-IDF program discussed in a previous post as well as the lecture about MapReduce algorithms from Cloudera‘s free Hadoop training? Thanks to the new joining...

View Article

Image may be NSFW.
Clik here to view.

Virtual Python environments

May 24, 2009, 9:30 am

Judging from some of the questions about Dumbo development that keep popping up, virtual Python environments are apparently not that widely known and used yet. Therefore, I thought it made sense to...

View Article

Image may be NSFW.
Clik here to view.

Dumbo on Cloudera’s distribution

May 31, 2009, 8:49 am

Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to...

View Article

Image may be NSFW.
Clik here to view.

Multiple outputs

June 8, 2009, 5:11 am

Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example: from dumbo import run, sumreducer, opt def mapper(key, value): for word in value.split(): yield word,...

View Article

Image may be NSFW.
Clik here to view.

Integration with Java code

June 16, 2009, 3:15 pm

Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks...

View Article

Image may be NSFW.
Clik here to view.

Dumbo over HBase

July 31, 2009, 6:46 am

This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or...

View Article

Image may be NSFW.
Clik here to view.

Moving to Hadoop 0.20

November 23, 2009, 1:26 am

We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process: We’re probably going to...

View Article

Image may be NSFW.
Clik here to view.

Dumbo on Amazon EMR

December 23, 2009, 1:24 am

A while ago, I received an email from Andrew in which he wrote: Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so: $ elastic-mapreduce...

View Article

Image may be NSFW.
Clik here to view.

Reading Hadoop records in Python

December 23, 2009, 12:32 pm

At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows: Hadoop records → CsvRecordInput...

View Article

Image may be NSFW.
Clik here to view.

Consuming Dumbo output with Pig

February 5, 2010, 2:39 am

Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time....

View Article