Biological and Biomedical Data Science (DSAN 6150)
Introduction
Welcome to Biological and Biomedical Data Science, a second-year course in the Georgetown University Masters in Data Analytics program.
Biological and biomedical data are at the heart of many industries, including the life sciences, healthcare, pharmaceuticals and biotechnology, to name a few. Research and development in these industries aim to create new knowledge about biology (human, animal or agricultural) and use that knowledge to create products (medicines, vaccines, seeds, pesticides) that can help their target audiences. Data is central to these purposes, but unfortunately, like most things, it’s not quite so simple.
In data science, we have become used to situations where you have massive amounts of data that are available to you for mining and exploring, typically from tech giants, cloud providers and fintech. Data in the life sciences is much more limited; a drug can be approved with as few as 1000 patients in a clinical trial. Data is both harder and more expensive to collect, and so we are often much more careful about how we collect and utilize data. For context, a laboratory experiment may consist of 10 mice, and initial clinical studies might include 15 subjects. Experiments are costly, and so are often limited.
The purpose to which we use the data is also quite different. Most data science applications you may encounter revolve around predictive modeling, where the intent is to create robust predictions even if the model is a black box; we don’t really care how the predictions are generated but rather whether the predictions are accurate or robust. In the life sciences, we’re often more interested in what factors influence the prediction, why we see the patterns we see, and can the patterns be attributable to a set of factors either observable or imposed. So the currency we deal with involves correlations, associations and causality. We’re interested to get an understanding of how Nature works.
Leo Breiman, in 2001, wrote a seminal and influential paper, “Statistical Modeling: The Two Cultures”, where he distinguished between data models (classical statistical models) and algorithmic models (what we now call ‘black box’ predictive models). This is an important distinction to understand, since they require different mental models to work with, and different arrangements of the data, and are also interpreted and evaluated differently.
Perhaps you saw this distinction in the materials for DSAN 5000 and DSAN 5100.
Roadmap for this course
This course is inspired by the widespread reporting of health and medical results and the need for objective assessment of these results to understand their implications and generalizability. There are several questions that are of interest here:
- How can we judge all the reporting around health risks, new cures and treatments, and lifestyle choices
- What are the challenges in making robust and valid claims
- How can we advance science through experiments and data in a way that ensures reproducibility and generalizability?
- How industry approaches creating a treatment or drug
- How can we find factors that affect our health (that are not from Facebook) and are reliable
- How personalized is our knowledge, or how applicable is the available knowledge to our personal situations?
The over-riding question that we will re-visit throughout the course is, how do we make decisions from data, how do we balance risks, and how do we try to ensure reproducibility and generalizability of our results and decisions
Data science topics
Biological and biomedical data poses several unique features that need to be addressed in data science, sometimes by specialized methods. Some of the topics that we will cover in this course include
- Collecting data, and levels of evidence from how the data is collected
- Observational studies, epidemioligical studies, randomized studies
- Confounding and causality, direct and indirect effects
- Accommodating imperfect study design to make better inferences and decisions
- The role of randomization and stratification in making better inference
- Missing data, particularly censoring and truncation, and how to accommodate them to make valid estimation and inference
- High-dimensional data (collecting lots of information on relatively few units of observation)
- Information content
- ML methods help (penalized models, gradient boosting, trees)
- Reproducibility can suffer due to non-specificity of particular effects
- Type I and II error, and how multiple testing affects it
- p-values as a necessary but poor decision tool, and how we can do better
- Bayesian approaches to estimation, inference and decision-making
- And the list can go on …