The curious collection of a slightly mad scientist
Here’s me starting some basic snooping on the technical side of what the NSA does when it attempts to monitor everyone in the world. How can it do that? Read on if you want to geek out.
… On June 9, the Wall Street Journal reported that for the last few years the National Security Agency has been relying on a software program “with the quirky name Hadoop” to help it make sense of its enormous collections of data. Named after a toy elephant that belonged to the child of one of the original developers of the program, “Hadoop,” reported the Journal, is a crucial part of “a computing and software revolution … a piece of free software that lets users distribute big-data projects across hundreds or thousands of computers.”
“Revolution” is probably the most overused word in the chronicle of Internet history, but if anything, the Wall Street Journal undersold the real story. Hadoop’s importance to how we live our lives today is hard to overstate. By making it economically feasible to extract meaning from the massive streams of data that increasingly define our online existence, Hadoop effectively enabled the surveillance state.
And not just in the narrowest, Big Brother, government-is-watching-everyone-all-the-time sense of that term. Hadoop is equally critical toprivate sector corporate surveillance. Facebook, Twitter, Yahoo, Amazon, Netflix — just about every big player that gathers the trillions of data “events” generated by our everyday online actions employs Hadoop as a part of their arsenal of Big Data-crunching tools. Hadoop is everywhere — as one programmer told me, “it’s taken over the world.”
The Journal’s description of Hadoop as “a piece of free software” barely scratches the surface of the significance of this particular batch of code. In the past half-decade Hadoop has emerged as one of the triumphs of the non-proprietary, open-source software programming methodology that previously gave us the Apache Web server, the Linux operating system and the Firefox browser. Hadoop belongs to nobody. Anyone can copy it, modify, extend it as they please. Funny, that: A software program developed collaboratively by programmers who believe that their code should be shared in as open and transparent a process as possible has resulted in the creation of tools that everyone from the NSA to Facebook uses to annihilate any semblance of individual privacy. But what’s even more ironic, and fascinating, is the sight of intelligence agencies like the NSA and CIA joining in and becoming integral players in the world of open source big data software. The NSA doesn’t just use Hadoop. NSA programmers have improved and extended Hadoop and donated their changes and additions back to the larger community. The CIA actively invests in start-ups that are commercializing Hadoop and other open source projects.
They’re all in it together. The spooks and the social media titans and the online commerce goliaths are collaborating to improve data-crunching software tools that enable the tracking of our behavior in fantastically intimate ways that simply weren’t possible as recently as four or five years ago. It’s a new military industrial open source Big Data complex. The gift economy has delivered us the surveillance state.
Hadoop’s earliest roots go back to 2002, when Doug Cutting, then the search director at the Internet Archive, and Michael Cafarella, a graduate student at the University of Washington, started working on an open-source search engine called “Nutch.” But the project did not get serious traction until Cutting joined Yahoo and began to merge his work into Yahoo’s larger strategic goal of improving its search engine technology so as to better compete with Google. Significantly, Yahoo executives decided not to make the project proprietary. In 2006, they blessed the formation of Hadoop, an open-source project managed under the auspices of the Apache Software Foundation. (For a much more detailed look at the history of Hadoop, please read this four-part history of Hadoop at GigaOm.)
Hadoop is basically a nifty hack. The definition, per Wikipedia, is surprisingly simple: “It supports the running of applications on large clusters of commodity hardware.” Bottom line, Hadoop provides a means for distributing both the storage and processing of an enormous amount of data over lots and lots of relatively inexpensive computers. Hadoop turned out to be cheap, fast and scalable — meaning it could expand smoothly in capacity as the flows of data it was crunching burgeoned in size, simply though plugging in extra computers to the network. Hadoop was also fundamentally modular — different parts of it could be easily replaced by custom designed chunks of software, making it seamlessly adaptable to the individual circumstances of different corporations — or government agencies.
Hadoop’s debut was timely, addressing not only the problems Yahoo faced in managing the enormous amounts of data produced by its users, but also those that the entire Internet industry was simultaneously struggling to cope with. Basically, the Internet had become a victim of its own success. The enormous flows of data generated by users of the likes of Facebook and Twitter far overwhelmed the ability of those companies to make sense of it. There was too much coming in too fast. Hadoop helped companies cope with the tsunami — it was, in the words of Jeff Hammerbacher, an early employee of Facebook, “our tool for exploiting the unreasonable effectiveness of data.”
Before Hadoop, you were at the mercy of your data. After Hadoop, you were in charge. You could figure out all kinds of interesting things. You could recognize patterns in the data and start to make inferences about what might happen if you made tweaks to your product. What did users do when the interface was adjusted like this? What kinds of ads made them more likely to pull out their credit cards? What did that batch of millions of Verizon calls reveal about the formation of a potential terrorist cell? Facebook wouldn’t be able to exploit the insights of its so-called social graphwithout tools like Hadoop.
“Hadoop has become the de facto standard tool for cost-effectively processing Big Data,” says Raymie Stata, who served as chief technology officer at Yahoo before eventually starting his own Hadoop-focused start-up, Altiscale. And the significance of being able to cheaply process Big Data, to accurately “measure” what your users are doing, he added, is a “big deal.” …
When Facebook, Twitter, Yahoo! and others bet big on Hadoop, they also knew that HDFS and MapReduce were limited in their ability to deal with expressive queries through a language like SQL. This is how Hive, Pig, and Sqoop were ultimately hatched. Given that so much data on earth is managed through SQL, many companies and projects are offering ways to address the compatibility of Hadoop and SQL. Pivotal HD’s HAWQ is one example—a parallel SQL-compliant query engine that has shown to be 10 to 100s of times faster than other Hadoop query engines in the market today—and it was built to support petabyte data sets.
… Who Uses Hadoop?
A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop PoweredBy wiki page. …
… Where did Hadoop come from?
Mike Olson: The underlyingtechnology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.
What problems can Hadoop solve?
Mike Olson: The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples. …
It’s fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many “shrink wrapped” applications today that you can get right out of the box and run on your Hadoop processor. It’s similar to the early ’80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data. That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they’re able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language …
Hadoop is an open-source software framework for running large batch-oriented information analytics jobs across clusters of commodity servers. Inspired by work at Google and Yahoo!, Hadoop exposes the MapReduce programming paradigm and can also be used with open-source tools from Apache including Hbase, Pig and Hive. Unstructured data stored in Hadoop is distributed and stored multiple times across the Hadoop machine cluster to optimize both performance and reliability.… Every day, people send 150 billion new email messages. The number of mobile devices already exceeds the world’s population and is growing. With every keystroke and click, we are creating new data at a blistering pace.
This brave new world is a potential treasure trove for data scientists and analysts who can comb through massive amounts of data for new insights, research breakthroughs, undetected fraud or other yet-to-be-discovered purposes. But it also presents a problem for traditional relational databases and analytics tools, which were not built to handle the data being created. Another challenge is the mixed sources and formats, which include XML, log files, objects, text, binary and more.
“We have a lot of data in structured databases, traditional relational databases now, but we have data coming in from so many sources that trying to categorize that, classify it and get it entered into a traditional database is beyond the scope of our capabilities,” said Jack Collins, director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research. “Computer technology is growing rapidly, but the number of [full-time equivalent positions] that we have to work with this is not growing. We have to find a different way.”
Enter Apache Hadoop, an open-source, distributed programming framework that relies on parallel processing to store and analyze tremendous amounts of structured and unstructured data. Although Hadoop is far from the only big-data tool, it is one that has generated remarkable buzz and excitement in recent years. And it offers a possible solution for IT leaders who are realizing that they will soon be buried in more data than they can efficiently manage and use.
“In the last 10 years, this is one of the most important developments because it’s really transforming the way we work, our business processes and the way we think about data,” said Ed Granstedt, a vice president at predictive analytics firm GoldBot Consulting. “This change is coming, and if government leaders don’t understand how to use this change, they’re going to get left behind or pushed aside.”
Why it matters
Hadoop is more than just a faster, cheaper database and analytics tool. In some cases, the Hadoop framework lets users query datasets in previously unimaginable ways.