Tag: UCSF Health Atlas

  • A Data Science Lesson Inspired by UCSF’s Health Atlas — Part 2

    Teaching Moment: So, What Exactly Is the Lesson Here?

    Now if you’re an instructor, and especially if you teach statistics, data science, public health, sociology, economics, geography, or honestly any subject where data show up wearing a fake mustache pretending to be “objective truth”!

    And the beautiful thing is that students do not need an advanced knowledge in machine learning to participate in meaningful data science.

    That point is important.

    Far too many students think data science begins when somebody starts throwing around terms like “deep neural networks,” “transformers,” or “Bayesian hierarchical spatiotemporal latent processes with adaptive priors.”

    Listen.

    Most students are still trying to remember where they saved the CSV file.

    What students REALLY need first is not technical intimidation.

    They need guided curiosity.

    They need to learn how to ask sensible questions, inspect patterns carefully, challenge assumptions, visualize relationships, and understand that data rarely speak clearly the first time you interrogate them.

    This Health Atlas example provides exactly that kind of environment.

    Students can begin with a genuinely important public-health question:

    Why do some communities exhibit substantially higher disability prevalence than others?

    And immediately, the investigation becomes interdisciplinary.

    Now students are talking about:

    • education
    • poverty
    • healthcare access
    • geography
    • regional history
    • public policy
    • and socioeconomic inequality

    That right there is real data science.

    Not because the models are complicated.

    But because the questions matter.

    ________________________________

    Sample Learning Objectives for this Lesson

    By the end of this lesson, students should be able to:

    1. Import, clean, and organize publicly available data
    2. Use visualization as a scientific thinking tool
    3. Distinguish association from causation
    4. Interpret statistical models in plain language
    5. Understand that geography matters
    6. Appreciate the iterative nature of data science

    Import, clean, and organize publicly available data

    Students learn that real-world datasets are rarely neat and perfectly labeled. They must inspect variables, identify missingness, interpret documentation, and construct usable analytic tables.

    Use visualization as a scientific thinking tool

    Students generate histograms, scatterplots, correlation maps, and geographic visualizations to identify patterns before formal modeling begins.

    This is enormously important pedagogically.

    Visualization is not merely decoration.

    Visualization is reasoning.

    Distinguish association from causation

    Students learn that county-level observational relationships do not imply that one variable directly causes another. Instead, the analysis motivates deeper questions and competing explanations.

    Frankly, society could use a LOT more people who understand this distinction.

    Interpret statistical models in plain language

    Students move beyond simply “running regression” and instead learn to explain what coefficients, predictions, uncertainty, and residuals actually mean in a substantive public-health context.

    Understand that geography matters

    Spatial clustering teaches students that nearby regions often share historical, economic, environmental, and healthcare structures. Students begin recognizing that spatial dependence violates the unrealistic fantasy that all observations are completely independent.

    Appreciate the iterative nature of data science

    Perhaps most importantly, students learn that a good analysis rarely concludes with:
    “We solved it.”

    Instead, good analyses end with:
    “Now we know what to ask next.”

    And honestly, that may be one of the healthiest intellectual habits we can teach anybody.

    Because the true spirit of data science is not about worshipping algorithms.

    It is about learning how to think carefully in the presence of uncertainty.

    _________________________________________

    Postscript

    What do you think? How would YOU modify this lesson?

    Would you change the public-health question? Add additional variables? Introduce different visualizations or modeling approaches? Expand the spatial component? Simplify the statistical modeling? Push students toward deeper policy discussions?

    And how do the learning objectives sound to you?

    Are they realistic? Too ambitious? Not ambitious enough?

    One of the beautiful things about teaching data science is that there is rarely a single “correct” pathway through the data. Every instructor brings a different perspective, a different intuition, and a different set of experiences into the classroom.

    So, let’s continue the conversation.

    Dr. Stats would truly love to hear your thoughts, suggestions, questions, critiques, and classroom ideas.

  • A Data Science Lesson Inspired by UCSF’s Health Atlas — Part 1

    I am not going to sugarcoat this: I am a statistician, and I believe a good deal of what we call “data science” has to do with statistics and statistical thinking!

    But then again, I have been around for some time, and have collaborated enough with computer scientists, mathematicians, biologists, neuroscientists, physicians, and education experts to know that while statistics is very, and I mean VERY, useful in tackling big-time data science problems, its applicability and robustness are greatly amplified when it draws strength from the approaches practiced in CS, the hard sciences, Math Ed, philosophy, and many other disciplines.

    One thing I learned from many years of teaching data analysis in the classroom is that intuition is a magnificently powerful engine.

    The other day I was attending this cool workshop where the magnificent instructors introduced us to this super awesome website: UCSF Health Atlas.

    And let me tell you something: once you enter that site, the teacher in you immediately goes into self-drive mode.

    Screenshot of the UCSF Health Atlas
    The UCSF Health Atlas Dashboard

    You start zooming in and out of maps, changing variables, comparing counties, switching color palettes, checking rates, spotting patterns, and before you know it, you have exported half the database onto your laptop.

    And that’s exactly what happened to me.

    I downloaded every file I could get my hands on.

    And then the fun really started.

    I began poking around the files like a raccoon that had just discovered an unattended Costco dumpster.

    Some variables immediately jumped out at me. Educational attainment. Preschool enrollment. Poverty indicators. Disability prevalence.

    Now listen, before we go any further, let me clarify something important.

    Data scientists are professionally trained overthinkers.

    You see one variable and your brain says, “nice.”
    You see two variables and your brain says, “hmmmm.”
    You see six variables and a map, and suddenly you’re Sherlock Holmes wearing cargo shorts and debugging R code at 2:17 in the morning.

    One thing led to another, and before I knew it, I had settled on a simple question:

    Do counties with lower educational opportunity and higher economic vulnerability also tend to show higher disability prevalence?

    Notice something very important here.

    I did NOT begin with a sophisticated statistical model.

    I began with curiosity.

    That, my friends, is a huge part of data science.

    The statistics come later. The machine learning comes later. The fancy words come later.

    But first comes the “I wonder if…”

    And that little sentence is incredibly powerful.

    Disability prevalence versus educational attainment and poverty indicators.

    The first thing I wanted to see was how these variables behaved individually across the country.

    A couple of things immediately became apparent. Disability prevalence across U.S. counties is not uniformly distributed. Neither is educational attainment. Some counties exhibit dramatically higher rates of adults lacking a high-school diploma, while others show substantially higher bachelor’s degree attainment.

    Already, your intuition starts whispering to you.

    The scatterplots and correlation maps reinforced the visual patterns. Counties with higher percentages of adults lacking a high-school diploma tended to exhibit higher disability prevalence. Counties with higher proportions of extremely low-income households also tended to show elevated disability prevalence.

    But the strongest visual relationship appeared to involve bachelor’s degree attainment. The relationship was strikingly negative.

    At this point, however, the maps started telling an even more compelling story.


    And THIS is where data science becomes essential.

    Clusters of elevated disability prevalence appeared throughout portions of Appalachia and the rural South. Similar spatial clustering emerged for low educational attainment and economic vulnerability.

    Now naturally, the statistician in me couldn’t resist fitting a model.

    Actually several models.

    First, I constructed a simple socioeconomic vulnerability index combining poverty burden, lower educational attainment, and weaker preschool enrollment indicators.

    The relationship was remarkably clear. Counties with greater socioeconomic vulnerability tended to exhibit substantially higher disability prevalence.

    Next, I fitted a standard multiple regression model predicting disability prevalence from poverty and educational indicators. The model explained approximately one-third of the county-level variation in disability prevalence.

    But then came my favorite part.

    The spatial model.

    The spatial generalized additive model substantially improved predictive performance, explaining over half of the observed variability in disability prevalence.

    And perhaps even more interestingly, the residual maps and Moran’s statistics still revealed remaining spatial structure.

    Translation?

    Even after accounting for education and poverty, geography was STILL trying to tell us something.

    That is one of the deepest lessons in all of data science.

    Good analyses rarely “finish” a problem.

    Instead, good analyses reveal better questions.

    And that may be the coolest part of the whole thing!

    Geographic patterns in disability prevalence, education, and poverty.

    Residual geographic structure after spatial modeling.