The Washington Post

How NASA is using Hadoop to advance climate science

Whenever climate scientists want to analyze data, they need to request it in its messy original format, clean it up and analyze it. That sort of work takes up valuable time, and so it makes sense that the federal government has started funding efforts to simplify the process.

Speaking at 2013 Hadoop Summit in San Jose on Wednesday, NASA software developer Glenn Tamkin explained how he and one of his colleagues have been cooking up a 34-node Hadoop cluster for NASA’s Center for Climate Simulation that can analyze slices of the data in response to end users’ queries. The new architecture could be handy in seeing how the data stacks up in comparison with other data sets used in the U.S. and in other countries.

Tamkin’s team has an 80 TB data set on its hands concerning all kinds of information about climate and atmosphere: winds, clouds, humidity, air and water temperature and so on for the past three decades. The data includes observational information mostly collected from satellites as well as simulation data for filling in gaps. But it’s not continually streaming in; rather, it gets fed in every once a year, Tamkin said. The data is already publicly available.

The developers have brought this data into the Hadoop Distributed File System and rely on all the scaled-out nodes to quickly compute sums, counts, averages, standard deviation and other measurements in MapReduce.

While the MapReduce jobs don’t run as fast as he would like — it took two minutes to answer one query Tamkin’s ran recently — the new Hadoop setup sounds like it would be a lot less trouble for scientists looking for basic information across many years.

NASA is now employing the Cloudera Distribution for Hadoop for this work, although Tamkin said he’s not using every part of it; he would like to tack on more components around managing the cluster to try to further optimize the system, he said. He also wants to develop a method for caching queries, so they can run faster.

The project will end up serving data out of Hadoop through an API to scientists across government agencies and private organizations later this year, Tamkin said. And like the data itself, the API will also become available to the general public, perhaps as soon as February 2014, Tamkin said.

(c) 2013,



Success! Check your inbox for details. You might also like:

Please enter a valid email address

See all newsletters

Show Comments
Most Read



Success! Check your inbox for details.

See all newsletters

Your Three. Videos curated for you.
Play Videos
What can babies teach students?
Unconventional warfare with a side of ale
A veteran finds healing on a dog sled
Play Videos
A fighter pilot helmet with 360 degrees of sky
Is fencing the answer to brain health?
Scenes from Brazil's Carajás Railway
Play Videos
How a hacker group came to Washington
The woman behind the Nats’ presidents ‘Star Wars’ makeover
How hackers can control your car from miles away
Play Videos
Philadelphia's real signature sandwich
Full disclosure: 3 bedrooms, 2 baths, 1 ghoul
Europe's migrant crisis, explained

To keep reading, please enter your email address.

You’ll also receive from The Washington Post:
  • A free 6-week digital subscription
  • Our daily newsletter in your inbox

Please enter a valid email address

I have read and agree to the Terms of Service and Privacy Policy.

Please indicate agreement.

Thank you.

Check your inbox. We’ve sent an email explaining how to set up an account and activate your free digital subscription.