For decades, technology enthusiasts have labored to impress upon us the perfectly enormous giganticness of the ocean of digital information in which we moderns swim: If the Internet as it existed in 1993 were Windsor Castle, how many Libraries of Congress would fit into just one of its broom closets? And so on.
The data ocean really has gone right on expanding at an exponential rate. And “exponential” is a scary term. IBM says that 90 percent of the data that exists in the world today has been created in just the past two years. We now generate 2.5 quintillion bytes of data every day. (A quintillion is 10 to the 18th power—as if that’s any help in grasping the magnitude.)
This data comes in a variety of forms, including text, audio, video, and still images. It comes from web transactions, traffic and climate sensors, satellites, cell phones, “smart” home appliances, factory equipment, and other machines that communicate with one another.
A staggering amount of data comes from social media alone. Last October, Facebook acquired its billionth user. About 250 million photos are uploaded to Facebook every day. Twitter users generate 12 terabytes (a terabyte is 1,000 gigabytes) of tweets per day.
It isn’t unusual for major companies to amass terabytes of data on their own. Some have far more. The Economist reported that Walmart’s databases contained 2.5 petabytes (2,500 terabytes or 2.5 million gigabytes) of data. That was three years ago.
The technology industry has settled, for the present, on a rather prosaic term for these mind-boggling amounts of digital stuff: big data. No matter what angle you approach it from, the question with big data is whether and how it can be made useful to people and organizations. Mining big data for patterns of useful information sometimes cannot be done within a reasonable time frame using traditional database software. The task often requires newer distributed-computing frameworks such as Apache Hadoop.
Despite the challenges, interesting findings from big data are beginning to emerge in all sorts of fields—retail, utilities, banking, health care, law enforcement, and more. The possibilities are impressive. A 2011 report by the McKinsey Global Institute suggests that retailers using big data “to the full” could boost their operating margins by 60 percent. The U.S. health care industry could use big data to “create more than $300 billion in value every year.”
Even politics is affected. In the wake of President Obama’s 2012 reelection, Time magazine and other sources pointed to big data as a major factor in his campaign’s success. Compiling mountains of previously unstructured data from social media sites and elsewhere, Obama’s number crunchers detected patterns that told them, with unprecedented accuracy, such things as which voting districts to target, which messages would appeal to women and minority voters, which potential donors to approach, how best to approach different types of donors, and how to spend their money most effectively.
Twin Cities organizations are in the thick of the hunt to find useful patterns in big data. Here are a few examples of what they’re doing.
Crunching large amounts of data is nothing new to Fair Isaac Corporation. Based in Arden Hills and better known as FICO, Fair Isaac is one of the companies behind the three-digit scores that determine consumers’ credit ratings.
There are about 315 million people in the United States, and roughly 70 percent of them “have some sort of credit,” says Andy Flint, FICO’s senior director of product management. So in its rating business alone, FICO “deals with hundreds of millions of credit records,” Flint says. “Our computers run continuously, churning through terabytes of data.”
But FICO scores are just part of the company’s business. Fair Isaac sells software, technology, and services to a wide variety of organizations. The common thread is that it all has to do with predictive analytics. And that is exactly where the central promise of big data lies.
The FICO score is a relatively old and widely familiar example of predictive analytics. Its purpose is to predict for creditors the likelihood that a would-be borrower or client will repay a loan or make good on a bill. But organizations would like to be better at predicting all kinds of things, from consumer preferences to power consumption to the chances that a given credit-card charge is fraudulent. Fair Isaac now helps them do all of those things.
“This is where we start dealing with data in really massive sizes,” Flint says. If, on behalf of a big-box retailer or a packaged-goods company, “you look at their customers’ daily interactions on Facebook and Twitter, or at their web-browsing behavior, that kicks off tremendous amounts of data,” he says.
What do clients want from those torrents of bytes? A retailer might be seeking insights into new or different products to offer a customer. “Knowing that [a shopper] likes a particular brand of breakfast cereal isn’t the end of the game,” Flint says. “You can give them a coupon for that cereal, but they’d buy it anyway . . . . A step beyond is to suggest something they haven’t bought but would likely find appealing.”
For airlines, a big-data question is how to schedule flight crews so that planes on all routes can have the necessary personnel, despite weather delays and other variables, yet return crew members to their hometowns as often as possible “so that they can sleep in their own beds,” Flint says. FICO sells technology to optimize those schedules.
Utility clients want to know how to get oil and gas across a distribution network at minimal costs and at the right times, he says. Another client is a sports-marketing company that creates season schedules for the National Football League. “They need to decide who should play whom each week, who gets Monday nights, and which match-ups would maximize viewership over the course of a season,” Flint says. “You could come up with more than a trillion variations on one season’s NFL schedule.”
Practically everyone who has studied the problem of escalating health care costs in the United States points to the idea of electronic medical records as part of the solution. But data miners see potential prediction-analysis opportunities in studying large numbers of digital health records. The catch is that the right kinds of data must be collected in the first place, and the information must reside in a format that is sharable throughout the health care system.
That is a serious challenge, says Prasanna Desikan, senior scientific advisor for the Center for Healthcare Research & Innovation at Allina Health System in Minneapolis. Although the health industry collects “huge amounts of data,” Desikan says, “we still haven’t figured out [the most useful] data to collect.”
From the standpoint of predictive analytics, he says, “This is an interesting place to be right now. The [health care] industry is at an early stage with data collection. The analyses we do today will drive what we need to collect tomorrow. We’ll say, ‘This is working, but this other thing isn’t because we don’t have the data.’”
What kinds of things do data miners want to predict? One of Desikan’s projects is aimed at foreseeing “preventable events” that would send patients to the hospital. “We’re trying to build models that tell us ‘Here are the chances this patient will be admitted in the future for, say, a heart attack,’” he says. Then Allina could call the patient to arrange a checkup, maybe run some tests, and maybe urge some lifestyle changes—this as opposed to waiting for the person to arrive at the hospital in an ambulance.
The primary goal is better treatment, Desikan says. “But of course this has cost implications, too, because it’s much cheaper to treat people before they get to the hospital.”