The Many Uses of Big Data
For decades, technology enthusiasts have labored to impress upon us the perfectly enormous giganticness of the ocean of digital information in which we moderns swim: If the Internet as it existed in 1993 were Windsor Castle, how many Libraries of Congress would fit into just one of its broom closets? And so on.
The data ocean really has gone right on expanding at an exponential rate. And “exponential” is a scary term. IBM says that 90 percent of the data that exists in the world today has been created in just the past two years. We now generate 2.5 quintillion bytes of data every day. (A quintillion is 10 to the 18th power—as if that’s any help in grasping the magnitude.)
This data comes in a variety of forms, including text, audio, video, and still images. It comes from web transactions, traffic and climate sensors, satellites, cell phones, “smart” home appliances, factory equipment, and other machines that communicate with one another.
A staggering amount of data comes from social media alone. Last October, Facebook acquired its billionth user. About 250 million photos are uploaded to Facebook every day. Twitter users generate 12 terabytes (a terabyte is 1,000 gigabytes) of tweets per day.
It isn’t unusual for major companies to amass terabytes of data on their own. Some have far more. The Economist reported that Walmart’s databases contained 2.5 petabytes (2,500 terabytes or 2.5 million gigabytes) of data. That was three years ago.
The technology industry has settled, for the present, on a rather prosaic term for these mind-boggling amounts of digital stuff: big data. No matter what angle you approach it from, the question with big data is whether and how it can be made useful to people and organizations. Mining big data for patterns of useful information sometimes cannot be done within a reasonable time frame using traditional database software. The task often requires newer distributed-computing frameworks such as Apache Hadoop.
Despite the challenges, interesting findings from big data are beginning to emerge in all sorts of fields—retail, utilities, banking, health care, law enforcement, and more. The possibilities are impressive. A 2011 report by the McKinsey Global Institute suggests that retailers using big data “to the full” could boost their operating margins by 60 percent. The U.S. health care industry could use big data to “create more than $300 billion in value every year.”
Even politics is affected. In the wake of President Obama’s 2012 reelection, Time magazine and other sources pointed to big data as a major factor in his campaign’s success. Compiling mountains of previously unstructured data from social media sites and elsewhere, Obama’s number crunchers detected patterns that told them, with unprecedented accuracy, such things as which voting districts to target, which messages would appeal to women and minority voters, which potential donors to approach, how best to approach different types of donors, and how to spend their money most effectively.
Twin Cities organizations are in the thick of the hunt to find useful patterns in big data. Here are a few examples of what they’re doing.
Crunching large amounts of data is nothing new to Fair Isaac Corporation. Based in Arden Hills and better known as FICO, Fair Isaac is one of the companies behind the three-digit scores that determine consumers’ credit ratings.
There are about 315 million people in the United States, and roughly 70 percent of them “have some sort of credit,” says Andy Flint, FICO’s senior director of product management. So in its rating business alone, FICO “deals with hundreds of millions of credit records,” Flint says. “Our computers run continuously, churning through terabytes of data.”
But FICO scores are just part of the company’s business. Fair Isaac sells software, technology, and services to a wide variety of organizations. The common thread is that it all has to do with predictive analytics. And that is exactly where the central promise of big data lies.
The FICO score is a relatively old and widely familiar example of predictive analytics. Its purpose is to predict for creditors the likelihood that a would-be borrower or client will repay a loan or make good on a bill. But organizations would like to be better at predicting all kinds of things, from consumer preferences to power consumption to the chances that a given credit-card charge is fraudulent. Fair Isaac now helps them do all of those things.
“This is where we start dealing with data in really massive sizes,” Flint says. If, on behalf of a big-box retailer or a packaged-goods company, “you look at their customers’ daily interactions on Facebook and Twitter, or at their web-browsing behavior, that kicks off tremendous amounts of data,” he says.
What do clients want from those torrents of bytes? A retailer might be seeking insights into new or different products to offer a customer. “Knowing that [a shopper] likes a particular brand of breakfast cereal isn’t the end of the game,” Flint says. “You can give them a coupon for that cereal, but they’d buy it anyway . . . . A step beyond is to suggest something they haven’t bought but would likely find appealing.”
For airlines, a big-data question is how to schedule flight crews so that planes on all routes can have the necessary personnel, despite weather delays and other variables, yet return crew members to their hometowns as often as possible “so that they can sleep in their own beds,” Flint says. FICO sells technology to optimize those schedules.
Utility clients want to know how to get oil and gas across a distribution network at minimal costs and at the right times, he says. Another client is a sports-marketing company that creates season schedules for the National Football League. “They need to decide who should play whom each week, who gets Monday nights, and which match-ups would maximize viewership over the course of a season,” Flint says. “You could come up with more than a trillion variations on one season’s NFL schedule.”
Practically everyone who has studied the problem of escalating health care costs in the United States points to the idea of electronic medical records as part of the solution. But data miners see potential prediction-analysis opportunities in studying large numbers of digital health records. The catch is that the right kinds of data must be collected in the first place, and the information must reside in a format that is sharable throughout the health care system.
That is a serious challenge, says Prasanna Desikan, senior scientific advisor for the Center for Healthcare Research & Innovation at Allina Health System in Minneapolis. Although the health industry collects “huge amounts of data,” Desikan says, “we still haven’t figured out [the most useful] data to collect.”
From the standpoint of predictive analytics, he says, “This is an interesting place to be right now. The [health care] industry is at an early stage with data collection. The analyses we do today will drive what we need to collect tomorrow. We’ll say, ‘This is working, but this other thing isn’t because we don’t have the data.’”
What kinds of things do data miners want to predict? One of Desikan’s projects is aimed at foreseeing “preventable events” that would send patients to the hospital. “We’re trying to build models that tell us ‘Here are the chances this patient will be admitted in the future for, say, a heart attack,’” he says. Then Allina could call the patient to arrange a checkup, maybe run some tests, and maybe urge some lifestyle changes—this as opposed to waiting for the person to arrive at the hospital in an ambulance.
The primary goal is better treatment, Desikan says. “But of course this has cost implications, too, because it’s much cheaper to treat people before they get to the hospital.”
The City of Minneapolis has 830 police and investigators to serve a population of more than 382,000 people. Aside from responding to 911 calls, where and how should those cops be deployed, and what sorts of activities should they engage in to have the greatest impact on crime?
Those are the kinds of questions for which newly appointed chief of police Janee Harteau seeks answers from crime records, arrest reports, and other documents now stored in digital form.
Minneapolis has known for years that about 50 percent of all major crimes occur in 5 percent of the city’s geography, Harteau says. These “hot spots” haven’t changed much in more than a decade. But over the years, as paper record-keeping gave way to electronic storage, the digitized crime database has grown, and analytic methods have improved. The upshot is that the city’s ability to predict where crimes are likely to occur has been refined.
“We used to think in terms of precincts, then sectors, then hot spots,” Harteau says. “Now we think about micro hot spots. We can narrow it down to one city block and say that if we put officers there, we can have an impact on crime.”
Another pattern that has shown up in crime data in Minneapolis and other cities: If you want to use police personnel effectively, don’t park an empty squad car in a hot spot and leave it there, and don’t send cars though an area on regular patrols. A more effective method, it now appears, is to send a car into the hot-spot area for 15 minutes at random intervals. Also, during these 15-minute drop-ins, officers should get out of the car and talk to people.
“People will tell you things if you just ask them what’s going on,” Harteau says. She now has the data to back up that assertion.
Jaideep Srivastava is a computer science professor at the University of Minnesota and chief technology officer for Ninja Metrics, Inc., a university spinoff company specializing in social analytics. What that means, he explains, is that he uses data mining and other techniques to analyze big data from social systems. The work “has the potential to transform the way we study psychology, social psychology, and more,” he says.
“Social systems” include the usual suspects, such as Facebook, but Srivastava is particularly interested in multiplayer online games. Activision’s World of Warcraft, for instance, has more than 9 million paid subscribers. Players spend an average of 3.5 hours a day with the game, he says. That generates big data.
The insights Srivastava seeks, both in his university research and at Ninja Metrics, have to do with ineffable qualities like trust and influence. He says that data from Sony’s online game EverQuest II, for instance, is a gold mine of information about the mechanics of trust, influence, and mentoring relationships.
In EverQuest II, a junior player can ask a senior player to serve as a mentor. The game also allows people to build houses, furnish them, and give other players various levels of permission to use them. Permission also can be withdrawn, step by step. “That gives you visible levels of trust,” Srivastava says. “It means that trust can be quantified and measured.”
The degree of influence that a player has over others in the game can be similarly measured. “It’s all driven by real people with real relationships,” he says.
Commercial applications? Ninja Metrics developed a “customer churn” model for another computer-gaming company that could predict with better than 80 percent accuracy whether a given player would quit the paid-subscription game within the next week.
More significantly, Srivastava says, “We believe we have come up with what I call the atom of social influence.” He hopes to develop models that will reveal the degree of influence that a given person has over others and in which areas the person is influential. Why? Suppose a phone company with 10 million customers wants to offer a new service that it knows will appeal to just 1 percent of those customers. In theory, the company could cut its marketing expenses dramatically by identifying its 100,000 most influential customers and marketing the service to them first. If 1 percent of the influentials are friendly to the offer, Srivastava says, “you’d then contact their neighborhoods and avoid the neighborhoods of the unfriendly influentials.”
The phone company knows who is in whose neighborhood of influence, he says, because mining the data in its customer records tells it who calls whom and how often, who initiates the calls, how long they talk, whether the calls occur during working hours, and more.
If that raises privacy concerns in your mind, you aren’t alone. The entire subject of big data is riddled with questions about who should have the right to know how much about whom. But that is a story for another day.
Big Data Vs. Useful Data
Plenty of challenges confront organizations that want to extract useful information from big data. Hurdles can include everything from inadequate processing software to a lack of expertise.
One fundamental problem is “accidental data collection,” says Robert Cooley, a longtime data mining expert who now serves as chief technology officer for OptiMine Software, Inc., of St. Paul. The company specializes in price optimization for paid search and other types of digital advertising.
Accidental data collection produces data—sometimes enormous heaps of it—that is useless for problem solving. “Not just hard to work with; I mean useless,” Cooley clarifies. The best way to decide what kinds of information to collect and store, he says, is to start with specific business issues you would like the data to address.
“‘I want to reduce shopping-cart abandonment on my e-commerce site,’ or ‘I want to do a better job of managing my Google ad spend’—those are business problems,” Cooley says. “‘We’re going to do business intelligence’ is not a business problem.”
Accidental data collection has been a headache for many years and certainly predates the term “big data,” Cooley acknowledges. What’s new is the vastly greater time and effort it can take to correct the problem if huge amounts of data are involved.
One time-honored example of accidental data collection arises, for instance, when online retailers mistakenly use the default setting on some commercial conversion-tracking software packages, then later discover that their reports don’t provide enough detail to shed light on issues such as shopping cart abandonment or online advertising optimization. That can be a serious problem, Cooley says, but it’s one with a 30-second fix.
Today, however, it can take months to set up a system with Apache Hadoop or some other powerful processing capability to analyze big data, Cooley says. “Then, if somebody comes in and says, ‘No, you’re collecting the wrong stuff,’ it can take months to correct.”