Quantcast
Channel: Dumbotics » Tips and tricks
Browsing all 10 articles
Browse latest View live

Image may be NSFW.
Clik here to view.

TF-IDF revisited

Remember the buffering problems for the TF-IDF program discussed in a previous post as well as the lecture about MapReduce algorithms from Cloudera‘s free Hadoop training? Thanks to the new joining...

View Article


Image may be NSFW.
Clik here to view.

Virtual Python environments

Judging from some of the questions about Dumbo development that keep popping up, virtual Python environments are apparently not that widely known and used yet. Therefore, I thought it made sense to...

View Article


Image may be NSFW.
Clik here to view.

Dumbo on Cloudera’s distribution

Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to...

View Article

Image may be NSFW.
Clik here to view.

Multiple outputs

Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example: from dumbo import run, sumreducer, opt def mapper(key, value): for word in value.split(): yield word,...

View Article

Image may be NSFW.
Clik here to view.

Integration with Java code

Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks...

View Article


Image may be NSFW.
Clik here to view.

Dumbo over HBase

This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or...

View Article

Image may be NSFW.
Clik here to view.

Moving to Hadoop 0.20

We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process: We’re probably going to...

View Article

Image may be NSFW.
Clik here to view.

Dumbo on Amazon EMR

A while ago, I received an email from Andrew in which he wrote: Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so: $ elastic-mapreduce...

View Article


Image may be NSFW.
Clik here to view.

Reading Hadoop records in Python

At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows: Hadoop records      → CsvRecordInput...

View Article


Image may be NSFW.
Clik here to view.

Consuming Dumbo output with Pig

Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time....

View Article
Browsing all 10 articles
Browse latest View live