The EMBERS Project Can Predict the Future With Twitter

Leah McGrath Goodman
Newsweek
March 7, 2015

For the majority of Americans born after World War II, it is unlikely Arlington, Virginia, holds any special significance. But for those who know that the outcome of the war largely hinged on Imitation Game–style code-breaking, Arlington has a mystique as the epicenter of American military cryptanalysis.

In 1942, the U.S. Army Signal Intelligence Service quietly took up residence at the Arlington Hall Junior College for Girls—a private school that instructed young ladies on art, music, manners, proper dress and home economics—and used it as its headquarters for staging attacks on Japanese cipher systems. The National Security Agency, founded in 1952, was originally based at Arlington Hall. The Defense Intelligence Agency, formed by Secretary of Defense Robert McNamara at the Pentagon a decade later, also occupied two buildings there.

Today, Arlington maintains its code-breaking roots, but now it’s cracking other types of codes—and has advanced into the realm of quantum computing, becoming a hotbed of government-funded research initiatives, spearheaded by both public and private institutions primarily serving Washington.

One of these, Virginia Tech (VT), offers a glimpse into just how much “big data” has changed the game by magnifying the U.S. intelligence community’s ability to forecast—with phenomenal accuracy—human behavior on a global scale by scouring Twitter, YouTube, Wikipedia, Tumblr, Tor, Facebook and more. VT is using algorithms and a variety of advanced tools to sort through dense and complex information for patterns in the chaos—patterns that frequently point to events before they happen, such as civil uprisings, disease outbreaks, humanitarian crises, mass migrations, protests, riots, political routs, even violence.

“Anytime you tweet or post on Facebook, you are becoming a part of the big data economy,” says Naren Ramakrishnan, a professor of computer science at VT and director of the university’s Discovery Analytics Center, which, as he describes it, “studies the entire gamut of data science.” Last year, the center moved its base of operations from VT’s Blacksburg, Virginia, campus to Arlington—also home to the Pentagon—after scoring more than $15 million of grants and contracts for its EMBERS project. Ramakrishnan runs the project, which is, so far, leading the arms race to turn big data into forecasts that U.S. policymakers and intelligence agencies can use.

“A lot of analysts can give you forecasts for the coming year, but when we do forecasts, we’re talking about specific dates,” says Ramakrishnan.

Since its inception in April 2012, an average of 80 to 90 percent of the forecasts it generates have turned out to be accurate—and they arrive an average of seven days in advance of the predicted event. EMBERS (short for Early Model Based Event Recognition using Surrogates) derives its intelligence from what data geeks call “open-source indicators”—social media, satellite imagery and more than 200,000 blogs that are publicly available. It mines up to 2,000 messages a second and purchases open-source data such as Twitter’s “firehose,” which streams hundreds of millions of real-time tweets a day.

While much has been made of the government’s secret surveillance operations—particularly those that spy on Americans—the EMBERS project is focused on tracking human behavior overseas and publishing its findings, even if negative. “We are not looking at anything classified and we aren’t forecasting terrorism, because we don’t have access to those kinds of back channels,” Ramakrishnan says. “We are looking at data anyone can get.”

It’s a fully automated system that churns out 45 to 50 total alerts a days, 24 hours a day, seven days a week. It spits out the date of a predicted event, the location and coordinates, who or which groups are involved, the reason for the unrest and the confidence level of the prediction. The goal? To forecast anything that might give the U.S. a heads-up on protecting Americans overseas, as well as its allies.

The project was first put to work examining open-source data streams in Latin America: It accurately predicted the impeachment of Paraguay’s president in 2012, the World Cup protests in Brazil in 2013, and the 2014 violent student protests in Venezuela. These days, the program monitors 20 countries in Latin America and is beginning to move into the Middle East and North Africa, covering Iraq, Syria, Egypt, Bahrain, Jordan, Saudi Arabia and Libya.

EMBERS was the product of a 2012 contest organized by Jason Matheny, an associate director of the government’s Office for Anticipating Surprise (yes, that’s the name of a real office) and a program manager at the Intelligence Advanced Research Projects Activity program in the Office of the Director of National Intelligence. Three teams—from Virginia Tech, quantum computing firm Raytheon BBN Technologies in Cambridge, Massachusetts, and HRL in Malibu, California, formerly Hughes Research Laboratories—were asked to build the best possible forecasting model based on open-source indicators. The most successful of these was EMBERS, which ended up integrating several members of the other teams into its own, including Raytheon BBN, which now builds some of EMBERS’s social media models, like the ones trying to forecast civil unrest from reading Twitter feeds. Some of the guiding principles of the research, says Scott Miller, senior technical director of Raytheon BBN’s speech and language group, are astoundingly simple.

“We look for chatter, specific words indicative of protest,” says Miller. “We found there’s a correlation between the aggregate frequencies of unrest terms—for example, the Spanish word protesta—and the amount of civil unrest that we find happening in those regions.”

STEPHANIE MCGEHEE/REUTERS

Kuwaiti citizen Raken Subaiya checks his Twitter feed on his phone as Yousef al Anazi looks on during a sit-in protest in front of the Justice Palace in Kuwait City on October 19, 2012.

In other cases, though, information coming in can be much more complex. Because the info can be in the form of a picture, words or a chart—not to mention spanning many different languages and dialects—EMBERS uses advanced data-extraction and translation methods in partnership with another Cambridge company, Basis Technology, which enriches data and provides text analytics tools that, rather than translate foreign languages into English, draw direct meaning from native tongues. For instance, it is able to interpret Arabic printed in English phonetic characters (popular on Twitter). Graphical data is read right off Tumblr and aerial satellite photos are processed through automated tools for imagery.

Despite the technological sophistication, the algorithms of the predictive models still have to go through a great deal of trial and error. A team of 80 experts and 13 subcontractors—including social scientists, computer scientists, epidemiologists, political scientists, statisticians and regional experts for each country—work on designing and updating the best possible models. Ramakrishnan likens training computers to recognize patterns to teaching email applications to recognize spam. There is a “supermodel” that, over time, “learns which models are best, but it keeps on learning, because the situations in these countries change over time,” says Ramakrishnan. The supermodel receives a monthly report card on the accuracy of its predictions, which tells it which models are working in which combinations—and which ones aren’t. Then they adjust accordingly.

The independent contractor that reads and grades the accuracy of EMBERS’s forecasts is a nonprofit research facility in nearby McLean, Virginia, called MITRE, a collection of government-funded research centers. Terry Reed, the information systems engineer in MITRE’s Homeland Security Systems Engineering and Development Institute, oversees a team of about a dozen people who match EMBERS alerts to news reports to determine if its predictions come true. EMBERS now scores nearly perfect in predicting that events will happen, but is still working on getting the details of each event right, says Matheny.

Ramakrishnan says he believes EMBERS has the potential to forecast population-level events all over the world. “One could imagine technologies like this would be useful in the future and could become mainstream,” he says. “Trying to predict this is not new. What is new is that social media is allowing us to do this better.”

To date, government agencies are not concertedly acting on the EMBERS project’s predictions, and it remains unclear what the government plans to do with these burgeoning abilities. Matheny declined to disclose exactly which government agencies are keen to adopt the predictive technology of EMBERS, but he confirmed to Newsweek that intelligence, public health, humanitarian affairs and global and national security agencies are closely tracking it. “We keep government partners informed about the results of the research,” he says. “Over a dozen agencies have been given regular updates on the progress of this research.” One of the agencies using EMBERS alerts, says Ramakrishnan, is the Centers for Disease Control and Prevention. In addition to providing information to government agencies, VT also can sell access to its social-media technologies to commercial entities, although there is no immediate plan for that yet, says Ramakrishnan.

“There are a lot of legitimate reasons to do this,” says Ramakrishnan. “This can allow us to increase our security at hot spots or offer more accurate travel advisories, protect Americans from violence and increase our security at embassies.”

MITRE, for its part, has deep connections to the nation’s defense, security and intelligence apparatus. In fact, according to MITRE, Reed represents the Department of Homeland Security’s information security chief on a committee within the National Security Systems Working Group focused on policy issues related to classified information systems. While MITRE confirmed Reed’s work with EMBERS, she declined to be interviewed by Newsweek.

EMBERS also may not be the only government project being honed to target social media for forecasting purposes. In February, when a group claiming to be associated with ISIS briefly took over Newsweek’s Twitter feed, it released what appeared to be an Army document detailing “The Gist Mill Pilot Project,” which referred to a “concept of operations” for open-source indicators and “social media analysis.” According to a Pentagon spokesman, the project was discontinued in 2013, but the Army is in the process of fusing social media with its traditional intelligence, surveillance and reconnaissance operations and continuously introducing new capabilities.

Despite the predictive benefits of tapping into open-source data, Karen Greenberg, the director of the Center on National Security at Fordham University in New York, cautions that closely tracking the masses through social media and other means sounds a lot less Imitation Game and a lot more Minority Report.

“We really need to decide on some guidelines and legal and ethical parameters at the initial stages of all of these projects,” she says. “We have seen that when that does not happen, we later hear from our government, ‘We are dependent on this program, we can’t dismantle it now.’ The consequences of these programs are extraordinary. As a nation, do we agree that we are so unsafe that we need these programs to reduce our risk to zero at the expense of our privacy?”

Intelligence officials often point out that the term mass surveillance is a misnomer, arguing that the goal of government surveillance is to target specific individuals or groups, not the masses. But EMBERS only engages in mass surveillance. “We are not tracking individuals in our project,” says Ramakrishnan. “We are following crowds and groups.” He notes the program does follow public figures’ Twitter feeds and other key leaders, due to the fact they have outsized influence on the masses, but not ordinary citizens.

Greenberg adds that while such tools are no doubt useful, there are signs the government may be becoming overly dependent on technology to alert it to security threats. “Somehow, we missed the Arab Spring, we missed the rise of ISIS,” she says. “These are valuable technological tools, but there is no substitute for being on the ground. You want your answer from more than two clicks.”

Raytheon BBN’s Miller agrees that when in doubt, there’s nothing like the ground truth: “Right now, our Middle East predictions are not there yet. The best way to figure something out is to just ask somebody.”