By Julie Engebretson
It should come as no surprise that the amount of data created by modern society is growing at an incredible rate. The traditional sources of data we’re most familiar with, such as credit card and bank transactions, retail purchases, social media accounts, medical visits and government records, are constantly receiving more and more information about us. And things that used to be entirely data-free –– automobiles, household appliances, vending machines, even the clothes and shoes we wear –– are now equipped to silently collect and transmit data that make even our simplest and most private activities a new source of information ready for analysis.
Just how big is the wave of data washing over us nowadays? Huge – and getting much bigger. It’s estimated that about 16.3 zettabyes of data –– the equivalent of 16.3 trillion gigabytes (GBs) –– is produced in the world each year. Using a cell phone with 256 GB of storage for comparison, that’s enough data produced each year to fill up more than 59 billion phones. Another way to think about it is, we’re producing enough data annually to allow each person on Earth to take about 22 million digital photos on their phone each year.
And there’s no end in sight. By 2025, the amount of data produced in the world each year is expected to increase 10 times.
These huge sets of collected information are sometimes referred to as “big data,” and faculty members in Baylor’s College of Arts & Sciences are busy looking for ways that such massive data sets can help scientists and researchers solve some of the challenges faced by society.
Enter data science
The growing field of data science studies big data, seeking to take these massive stores of facts and figures and find the patterns or meaning buried within. The job’s complexity requires an interdisciplinary approach, and experts in mathematics, statistical science, computing and engineering must work together to first collect, manage and store these unwieldy volumes of raw data reliably, and then hopefully discover new knowledge from them.
“The data sets we’re talking about tend to be too large to fit on a single computer or to manipulate with traditional statistical methods and databases,” said Dr. Amanda Hering, associate professor of statistical science. “The data also tend to be messy, incomplete and of unusual type and unknown quality. So, machine learning –– also referred to as artificial intelligence –– as well as data mining, databases, statistics and visualization tools are being used to extract information and value from these massive data sets.”
Statisticians such as Hering are applying any number of established methods — or, in some cases, developing new methods as research projects demand them — to identify and then verify the significance of patterns found in these data.
“With small data sets, you worry about being able to find any pattern at all,” Hering said. “And with large data sets, you’re worried about finding patterns that aren’t real, or aren’t relevant. As humans, we’re attuned to patterns as we look for them in the world around us. Sometimes we’ll see patterns that aren’t meaningful, but are based simply on our own experience or anecdotal evidence. And the patterns we see locally don’t always hold true for the world at large. So, when we have large data sets, ‘statistical significance’ does not always imply practical relevance.”
In the Department of Environmental Science in the College of Arts & Sciences, Dr. Cole Matson, an associate professor, is an environmental toxicologist specializing in the genetic effects of contaminants on wildlife. One of his current projects looks at wild Gulf killifish living in the Houston Ship Channel.
“With the Houston Ship Channel, we’re talking about a highly polluted aquatic environment,” Matson said. “We identified that we had populations of killifish living in the ship channel that were highly resistant to some of the industrial chemicals that are found there, so we wanted to understand how they have become resistant, from a mechanistic standpoint. How have they adapted? Genetically, how have these fish been altered to make them better able to survive that pollution?”
Matson said these questions required the sequencing of the entire genome (i.e., an organism’s entire set of DNA) of 288 individual fish from seven different populations — which is just the sort of project where data science shines.
In fact, genome sequencing was among the earliest applications of data science. Launched in 1990, The Human Genome Project aimed to identify the sequence of chemical base pairs that comprise human DNA. But mapping the more than three billion nucleotides found in a single human reference genome (a representative example of a species’ set of genes) presented some immediate challenges. Every genome is unique, so mapping them had to account for multiple variations of each gene, and working with the overwhelming volume of data produced required expertise in the field of computer science.
After Matson’s success in mapping the genomes of 288 fish, he and his colleagues can now approach the entirety of that data without having to make too many assumptions ahead of time.
“What data science has really allowed us to do is approach projects without the need for any a priori hypotheses (i.e., hypotheses assumed as facts beforehand) about which genetic pathways are going to be important,” he said. “In the past, we could probe gene expression — using tools that have been around for 15 to 20 years — to look at a handful of genes that we knew we wanted to look at going in. We were limited to only probing pathways that we had already identified as likely important. Now, thanks to data science, we don’t have to make those assumptions going in. We’re looking at everything at once. We let the data, really the organism itself, tell us what’s important. We can simply ask the data set, ‘Where are the strongest signals of selection in our adapted populations?’ and, ‘What has changed the fastest in the resistant populations relative to the reference populations?’”
As valuable as data science is for projects such as Matson’s, involving these uniquely adaptable killifish, he says making use of data is even more critical when researchers leave the lab and “go out into the real world.”
“If I want to understand the toxicity of a single chemical or compound, I can design really targeted experiments because I have a decent idea about the types of toxicity I might expect to see,” Matson said. “When I go out into the real world, I’m not dealing with one chemical compound — I’m dealing with hundreds if not thousands of potentially toxic chemicals that organisms are exposed to. So it’s extremely difficult to predict what types of toxicity we could see.”
Data in the water
Due to factors such as population growth, changes in climate and the high cost of water treatment, water — for any variety of uses — is becoming an increasingly precious resource in the United States, and not just in drought-stricken California. Many cities and counties across Texas impose water restrictions on residents.
In the Department of Statistical Science in the College of Arts & Sciences, Amanda Hering is part of a project that hopefully will lead to clean water becoming available in more places.
Specifically, the project seeks to decentralize the treatment of wastewater by using smaller and virtually self-sustaining (unstaffed) treatment facilities serving individual communities. Hering is part of a team working on one such decentralized facility near the Colorado School of Mines, where she was on the faculty before coming to Baylor in 2015. The facility is now in its pilot phase and is supported by a grant from the National Science Foundation.
“As a pilot project, the treatment facility is not ‘online’ yet, but it is drawing water from a big apartment complex,” Hering said. “All of the water that goes down the drain, toilet and bathtub funnels to this system, where it is treated. The treatment of wastewater for drinkable re-use gets a bad rap, but oftentimes the wastewater that is cleaned and then released back into the environment is as clean as the water produced by water treatment plants.”
As the water is treated via physical means (e.g., barriers, screens), biological means (e.g., organisms put into the water as part of the treatment process) and chemical means, 30 to 35 different sensors placed at specific points along the treatment cycle collect information about the water quality such as pH and dissolved oxygen levels. Sensors “read” the water every minute, then those data are recorded and charted by a proprietary computer program.
“At one point recently, operators noticed very visibly on the chart that the pH level began to drop,” Hering said. “And it dropped so low that the entire community of biological organisms was killed, shutting down that part of the water treatment process entirely. It took more than two months for the biological community to recover.”
Most interestingly, however, Hering said the operators at the facility only noticed that something had gone wrong by monitoring each variable individually.
“But, the methodology that we developed identified the problem two and a half days before the operators did,” Hering said. “In hindsight, if we can flag serious faults ahead of time, operators will have more time to assess the situation and take steps to correct the fault.”
Ideally, Hering’s work will enable plant operators to catch active or potential faults before they cause two-month shutdowns. Ultimately, as even more data are collected and statistical methods are used to maximum effect, the hope is that decentralized water treatment facilities will operate with increasing reliability on their own — without the need for 24/7 staff.
“As they’re using our method to identify faults, the system sends a text message to their phone — which, here, is recruiting some computer science expertise. I don’t know how to develop an app for a phone, so we needed a computer scientist. The operator can click on the app using their phone wherever they happen to be, see a strongly flagged fault, monitor it and then make decisions about whether they need to go out to the site.”
Illuminating data science at Baylor
As the usefulness and even necessity for data science grows across multiple academic disciplines, a vision has emerged to establish some kind of data science center or institute at Baylor. A prime supporter of this vision is Dr. Erich Baker, professor of computer science in the School of Engineering and Computer Science. In Baker’s own research, discoveries typically are made where biology, statistics and computer science intersect, making him one of many faculty members whose work outside the classroom is absolutely dependent on innovations in data science.
“In the last five years, it has become obvious that we need to coalesce our data science needs on campus so that we can all work together across all the academic units,” Baker said. “Because the need is in all the schools and colleges, I’ve been trying to get together a critical mass of people on campus to spearhead an initiative that will let us see data science as more of a core on campus — a physical, centralized place where researchers could go. It would be a place where we could combine and leverage all of our computing and data science assets for the benefit of graduate students and faculty –– the people asking the questions.”
Baker isn’t alone in his desire to expand the profile of data science on campus. Baylor University’s new academic strategic plan, Illuminate, includes five multidisciplinary academic initiatives –– one of which centers on data sciences. “Baylor will address the mounting need for dynamic and rapid data analytics that spans virtually all major research emphases on campus,” according to Illuminate. “Data Sciences is the field that can drive all others.”
Illuminate calls for data sciences objectives to initially be focused in three complementary areas –– biomedical informatics, cybersecurity and business analytics, with an overarching theme of ethical uses of large-scale data.
“It is encouraging and exciting that the Baylor Board of Regents recognizes the need in this area and gives us the support to continue to explore what data science at Baylor will look like,” Baker said. “We hope to support data science through strategic hiring across all academic units, enhancement of our scientific computing infrastructure, new data science education programs and through the creation of centers where research faculty can collaborate on issues in data science.”
This article was published in the Fall 2018 issue of Baylor Arts & Sciences magazine, which is available in full here