Exploring Patient Satisfaction and Readmission in Medically Underserved Areas

View Project on GitHub


Working in a hospital has given me an appreciation of the complexity of healthcare outcome measures and the diversity of interested parties. For instance, unplanned readmission rate is one important outcome measure because they are very constly and often preventable. For this reason, this metric is often used by entities such as Center for Medicaid Services (CMS), to evaluate a facility.

On the other hand, healthcare is a service that can also be evaluated by patient reported outcome measures (PRAMs). Different parties prioritize some outcome measures over others, and in some cases, an improvement in one measure may be associated with a decline in another.

Dataset

CMS reports extensively on hospital-level readmission rates and patient reported outcome measures (PRAMs). One PRAM is the Consumer Assessment of Healthcare Providers and Systems (CAHPS).

A number of contextual factors also play a role in hospital outcomes. For example, tasks that decrease readmission rate, such as follow-up calls and medication reconciliation, demand resources. Therefore, facilities in medically underserved areas (MUAs) may struggle to achieve high levels of patient satisfaction and decrease readmission rates.

CMS reports on a number of measures, such as physicians per 1000 population, and infant mortality rates, which are used to identify MUAs.

Project goals

In this project, I aimed to explore the relationships between readmission rates and CAHPS outcomes in MUAs and non-MUAs. I hoped to answer these questions:

  • Does readmission rate and patient satisfaction have a positive correlation, or must we make tradeoffs in improvements?
  • Are measures used to determine MUA status related to a facility's outcome measures?
  • Should investments aimed at improving outcomes target MUA measures?

Summary of findings

  • My exploration showed that while minor, there is a negative association between measures of patient satisfaction and readmission rates (as I expected). This suggests that hospital administrators shouldn't necessarily be concerned that measures to decrease readmission rates would decrease patient satisfaction.
  • Additionally, the relationship between MUA status and measures is minimal. It appears that the underlying causes of these factors may be different. An important note is that these measures only cover a subset of the ways in which patients are medically underserved. Some areas with higher Index of Medical Underservice (IMU) scores may be underserving certain specific populations (such as those with low-income, or non-English speakers).

Full project

Getting data

I queried Center for Medicare/Medicaid Services' API to gather hospital-level data on readmission rates and patient satisfaction (via responses to CAHPS survey):

In []:
import pandas as pd
from pprint import pprint
import requests

# from private.py
API_TOKEN=''
API_SECRET=''
auth = (API_TOKEN, API_SECRET)

# the two API queries with an initialized empty dataframe for each
queries = [
    ["https://data.medicare.gov/resource/qk5r-kh7u.json?$limit=5000", pd.DataFrame()], 
    ["https://data.medicare.gov/resource/r32h-32z5.json?$limit=5000&$where=measure_name= 'Rate of readmission after discharge from hospital (hospital-wide)' AND score != 'Not Available'", pd.DataFrame()]]

# use each query and save results in the dataframe
for query in queries:
    if len(API_TOKEN)>0:
        r=requests.get(query[0], auth=auth).json()
        query[1] = pd.DataFrame.from_records(r)
    else:
        r=requests.get(query[0]).json()
        query[1] = pd.DataFrame.from_records(r)

# separate CAHPS and readmission datasets
cahps_df = queries[0][1]
readm_df = queries[1][1]

# import MUA dataset
mua_df= pd.read_csv("https://data.hrsa.gov//DataDownload/DD_Files/MUA_DET.csv")

print("The CAHPS DataFrame has " + str(cahps_df.shape[0]) + " rows and " + str(cahps_df.shape[1]) + " columns. \n\nThe Readmission DataFrame has " + str(readm_df.shape[0]) + " rows and " + str(readm_df.shape[1]) + " columns. \n\nThe Medically Underserved Area DataFrame has " + str(mua_df.shape[0]) + " rows and " + str(mua_df.shape[1]) + " columns.\n\nTotal number of columns:", cahps_df.shape[0] + readm_df.shape[0] + mua_df.shape[0])
Out[]:
The CAHPS DataFrame has 1240 rows and 30 columns. 

The Readmission DataFrame has 4385 rows and 23 columns. 

The Medically Underserved Area DataFrame has 19619 rows and 75 columns.

Total number of columns: 25244

Exploring the CAHPS and readmission datasets

I wrote some functions to compile column names and counts of variables in each column from a dataframe into callable lists, for example:

In []:
import cleaning.explore as explore
print("Facility names (first 15):\n")
pprint(explore.get_counts(cahps_df)[1][0:15])
print("\nColumns in the readmissions data frame:\n")
for col in explore.get_cols(readm_df):
      print(col)
Out[]:
Facility names (first 15):

MEMORIAL HOSPITAL                       4
WAYNE MEMORIAL HOSPITAL                 3
ST JOSEPH'S HOSPITAL                    2
SAINT FRANCIS MEDICAL CENTER            2
ST MARY MEDICAL CENTER                  2
COLUMBUS COMMUNITY HOSPITAL             2
O'CONNOR HOSPITAL                       2
ST LUKES HOSPITAL                       2
ST JOSEPH HOSPITAL                      2
HIGHLAND HOSPITAL                       2
HARDIN MEMORIAL HOSPITAL                2
ST JOSEPH MEDICAL CENTER                2
SOUTHEASTERN REGIONAL MEDICAL CENTER    2
GOOD SAMARITAN HOSPITAL                 2
OKLAHOMA SPINE HOSPITAL                 1
Name: facility_name, dtype: int64

Columns in the readmissions data frame:

provider_id
hospital_name
address
city
state
zip_code
county_name
phone_number
measure_id
measure_name
compared_to_national
denominator
score
lower_estimate
higher_estimate
measure_start_date
measure_end_date
location
location_address
location_city
location_state
location_zip
footnote

Notes from exploring datasets:

  • CAHPS
    • state is abbreviation
    • city, county, address, and facility name are in all caps
    • there are several hospital names that appear more than once
  • Readmission
    • there's a column called location_state and one called state.
    • Zip is called zip_code
    • state is abbreviation
    • city, county, address, and facility name are in all caps
    • there are several hospital names that appear more than once

Cleaning up these datasets

Starting with the CAHPS and Readmission datasets

First, I removed and renamed uninteresting columns from these two DataFrames

In []:
cahps_df_clean = cahps_df.drop(
    columns = [
        'cms_certification_number',
        'measure_start_date', 
        'measure_end_date', 
        'footnote', 'telephone'
        ]).rename(
            columns = {
                "county":"county_name"})

readm_df_clean = readm_df.drop(
    columns = [
        'footnote',
        'measure_id', 
        'measure_name',
        'phone_number',
        'location_address',
        'location_city',
        'location_state',
        'location_zip'
        ]).rename(
            columns = {
                'hospital_name':'facility_name',
                'compared_to_national':'rate_of_readmission_national_comparison',
                'score':'rate_of_readmission', 
                'denominator':'rate_denominator',
                'higher_estimate':'rate_of_readmission_higher_estimate',
                'lower_estimate':'rate_of_readmission_lower_estimate',
                'zip_code':'zip'})

Now for the Medically Underserved Area (MUA) dataset

This is a more involved process...

  • MUA status is determined by an IMU (Index of Medical Underservice) score of 62 or less. Accoring to HRSA, the IMU involves four variables
    • ratio of primary medical care physicians per 1,000 population,
    • infant mortality rate,
    • percentage of the population with incomes below the poverty level, and
    • percentage of the population age 65 or over

There are a LOT of variables in this dataset:

In []:
for col in explore.get_cols(mua_df):
    print(col)
Out[]:
Columns in the MUA data frame: 

MUA/P ID
MUA/P Area Code
Minor Civil Division FIPS Code
Minor Civil Division Census Code
Minor Civil Division Name
Designation Type Code
Designation Type
MUA/P Status Code
MUA/P Status Description
Census Tract
Designation Date
Designation Date.1
IMU Score
MUA/P Service Area Name
MUA/P Update Date
MUA/P Update Date.1
U.S. - Mexico Border 100 Kilometer Indicator
U.S. - Mexico Border County Indicator
State and County Federal Information Processing Standard Code
County or County Equivalent Federal Information Processing Standard Code
Complete County Name
County Equivalent Name
County Description
HHS Region Code
HHS Region Name
State FIPS Code
State Name
State Abbreviation
Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level
Percentage of Population Age 65 and Over
Infant Mortality Rate
Ratio of Providers per 1000 Population
Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level Index of Medical Underservice Score
Percentage of Population Age 65 and Over IMU Score
Infant Mortality Rate IMU Score
Ratio of Providers per 1000 Population IMU Score
Data Warehouse Record Create Date
Data Warehouse Record Create Date Text
Primary State FIPS Code
HPSA Primary State Abbreviation
Primary State Name
Primary HHS Region Code
Primary HHS Region Name
Common Region Code
Common Region Name
Common State Name
Common State Abbreviation
Common State FIPS Code
Common State County FIPS Code
Common County Name
Break in Designation
Medically Underserved Area/Population (MUA/P) Component Designation Date
Medically Underserved Area/Population (MUA/P) Component Designation Date.1
Medically Underserved Area/Population (MUA/P) Component Geographic Name
Medically Underserved Area/Population (MUA/P) Component Geographic Type Code
Medically Underserved Area/Population (MUA/P) Component Geographic Type Description
Medically Underserved Area Geography Type Surrogate Key
Medically Underserved Area/Population (MUA/P) Component Last Update Date
Medically Underserved Area/Population (MUA/P) Component Status Code
Medically Underserved Area/Population (MUA/P) Component Status Description
Medically Underserved Area/Population (MUA/P) Component Update Date
Designation Population in a Medically Underserved Area/Population (MUA/P)
Medically Underserved Area/Population (MUA/P) Metropolitan Indicator
Medically Underserved Area/Population (MUA/P) Metropolitan Description
Medically Underserved Area/Population (MUA/P) Metropolitan Indicator.1
Medically Underserved Area/Population (MUA/P) Population Type Code
Medically Underserved Area/Population (MUA/P) Population Type
Medically Underserved Area/Population (MUA/P) Population Type ID
Medically Underserved Area/Population (MUA/P) Total Resident Civilian Population
Medically Underserved Area/Population (MUA/P) Withdrawal Date
Medically Underserved Area/Population (MUA/P) Withdrawal Date in Text Format
Rural Status Code
Rural Status Description
Providers per 1000 Population
Unnamed: 74

1. I started by removing the majority of variables not related to these measure or MUA status because I was interested in simply getting a record of which areas are considered underserved.

In []:
mua_df_clean = mua_df.drop(
    columns = ['MUA/P ID', 'Common County Name', 'Complete County Name', 'MUA/P Service Area Name', 'County Description','State Name', 'U.S. - Mexico Border 100 Kilometer Indicator',  'U.S. - Mexico Border County Indicator', 'Medically Underserved Area/Population (MUA/P) Metropolitan Indicator.1', 'Medically Underserved Area/Population (MUA/P) Metropolitan Description',  'Medically Underserved Area/Population (MUA/P) Metropolitan Indicator',  'Rural Status Code',  'Designation Type Code', 
        'Minor Civil Division FIPS Code',  'Minor Civil Division Census Code', 'Minor Civil Division Name',  'MUA/P Status Code',  'Census Tract',  'Designation Date', 'Designation Date.1',  'State and County Federal Information Processing Standard Code',  'County or County Equivalent Federal Information Processing Standard Code', 'HHS Region Code','HHS Region Name',  'State FIPS Code',  'Data Warehouse Record Create Date', 'Data Warehouse Record Create Date Text',
        'Primary State FIPS Code','HPSA Primary State Abbreviation', 'Primary State Name', 'Primary HHS Region Code', 'Primary HHS Region Name','Common Region Code','Common Region Name', 'Common State Name','Common State Abbreviation','Common State FIPS Code','Common State County FIPS Code',  'Break in Designation', 'Medically Underserved Area/Population (MUA/P) Component Designation Date','Ratio of Providers per 1000 Population','Medically Underserved Area/Population (MUA/P) Component Designation Date.1',  
        'Medically Underserved Area/Population (MUA/P) Component Geographic Name','Medically Underserved Area/Population (MUA/P) Component Geographic Type Code','Medically Underserved Area/Population (MUA/P) Component Geographic Type Description', 'Medically Underserved Area Geography Type Surrogate Key', 'Medically Underserved Area/Population (MUA/P) Component Last Update Date','Medically Underserved Area/Population (MUA/P) Component Status Code','Medically Underserved Area/Population (MUA/P) Component Status Description',
        'Medically Underserved Area/Population (MUA/P) Component Update Date','Designation Population in a Medically Underserved Area/Population (MUA/P)', 'Medically Underserved Area/Population (MUA/P) Population Type ID','Medically Underserved Area/Population (MUA/P) Population Type Code','Medically Underserved Area/Population (MUA/P) Total Resident Civilian Population','Medically Underserved Area/Population (MUA/P) Withdrawal Date','Medically Underserved Area/Population (MUA/P) Withdrawal Date in Text Format', 'Unnamed: 74'
        ]).rename(columns={ 'Designation Type':'is-MUA?', 'MUA/P Status Description':'MUA-status-desc',  'County Equivalent Name' : 'county_name', 'State Abbreviation' : 'state','Medically Underserved Area/Population (MUA/P) Population Type': 'MUA-population-type'})

2. Next, I made the "is-MUA?" column a yes/no category

In []:
def binarize_categorical_variable(df, column, yescat):

    """ Turns a categorical variable in a DataFrame into a "yes" or "no" response
    arguments: 
    df = pandas DataFrame, 
    column = quoted column name
    yes = quoted *and bracketed* list of category names that you wish to turn to "yes" 
    returns: the original pandas df with new binarized variable"""
    
    df[column]=df[column].astype('category')
    
    cat_list=[] 

    for cat in df[column].cat.categories:
        cat_list.append(cat)

    repl_dict = {}

    for cat in cat_list:
        for yes in yescat:
            if cat == yes:
                repl_dict[cat]= "Yes"   
        if repl_dict.get(cat) == None:
            repl_dict[cat]="No"

    df[column] = df[column].replace(repl_dict)

    return df

mua_df_clean=binarize_categorical_variable(mua_df_clean, 'is-MUA?', ['Medically Underserved Area'])
In []:
# Change the county name to be all uppercase to match the other DataFrames
mua_df_clean['county_name'] = mua_df_clean.county_name.str.upper()
In []:
mua_df_clean.head()
Out[]:
MUA/P Area Code is-MUA? MUA-status-desc IMU Score MUA/P Update Date MUA/P Update Date.1 county_name state Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level Percentage of Population Age 65 and Over Infant Mortality Rate Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level Index of Medical Underservice Score Percentage of Population Age 65 and Over IMU Score Infant Mortality Rate IMU Score Ratio of Providers per 1000 Population IMU Score MUA-population-type Rural Status Description Providers per 1000 Population
0 70004 Yes Designated 42.5 06/12/2017 2017/06/12 AIRAI PW 59.2 7.3 12.7 0.0 20.1 22.4 0.0 Medically Underserved Area Rural 0.0
1 70228 Yes Designated 42.5 06/12/2017 2017/06/12 NGIWAL PW 59.2 7.3 12.7 0.0 20.1 22.4 0.0 Medically Underserved Area Rural 0.0
2 70350 Yes Designated 42.5 06/12/2017 2017/06/12 PELELIU PW 59.2 7.3 12.7 0.0 20.1 22.4 0.0 Medically Underserved Area Rural 0.0
3 70370 Yes Designated 42.5 06/12/2017 2017/06/12 SONSOROL PW 59.2 7.3 12.7 0.0 20.1 22.4 0.0 Medically Underserved Area Rural 0.0
4 70010 Yes Designated 42.5 06/12/2017 2017/06/12 ANGAUR PW 59.2 7.3 12.7 0.0 20.1 22.4 0.0 Medically Underserved Area Rural 0.0

Merging the datasets

At this point, I combined the CAHPS, readmission, and MUA datasets

In []:
def double_merge(df1, df2, df3, merge1, merge2):
    """ A function to merge 3 datasets using a left merge
    
    Arguments: 
    df1 =  pandas dataframe of the limiting dataset / want to keep all the columns
    df2, df3 = the other two pandas dataframes 
    merge1 = a list of columns to merge on for the first merge
    merge2 = a list of columns to merge on for the 2nd merge 
    
    Returns: a pandas dataframe of mergerd data
    """

    first_merge = df1.merge(df2, how = 'left', on = merge1)

    second_merge = first_merge.merge(df3, how = 'left', on = merge2)

    return second_merge

merge1 = ['facility_name','address','city','county_name','state','zip'] #the columns on which to merge cahps_df_clean and readm_df_clean
merge2 = ['state', 'county_name'] #the column on which to merge mua_df_clean with the other two

merged_dff = double_merge(cahps_df_clean, readm_df_clean, mua_df_clean, merge1, merge2).drop_duplicates(subset="address")

Completing the cleaning process

There are a couple more things to do to this merged dataset:

  • filter out NAs in the variables of interest (readmission rate and patient ratings)
  • fill the rest of the NAs or missing values with something meaningful
    • Some NAs mean the hospital is not listed in the MUA dataset (is not in an MUA)
    • Some mean that 0 people provided a given response for a patient satisfaction question
  • make sure each variable is the correct datatype
In []:
def cleanup_merged_df(df):
    """ Function to clean up the merged data
    Argument: the merged data as pandas DataFrame
    Output: Cleaned up merged data as pandas DataFrame"""

    ## This filters out rows where the mean facility rating and rate of readmission are null
    df = df[df.patients_rating_of_the_facility_linear_mean_score.notnull()]
    df = df[df.rate_of_readmission.notnull()]

    ## Mark rows that existed in the outcome measure (OC) DataFrame but not in the MUA DataFrame as not being MUAs
    df['is-MUA?'] = df['is-MUA?'].fillna(value="No")
    df['MUA-status-desc'] = df['MUA-status-desc'].fillna(value="Not Designated")
    df['MUA-population-type'] = df['MUA-population-type'].fillna(value="None")
    df['Rural Status Description'] = df['Rural Status Description'].fillna(value="No Info")

    # Fill missing count values with 0 (0 respondants)
    fill_list=[
        'number_of_completed_surveys',
        'number_of_sampled_patients',
        'patients_rating_of_the_facility_linear_mean_score',
        'patients_who_gave_the_facility_a_rating_of_0_to_6_on_a_scale_from_0_lowest_to_10_highest',
        'patients_who_gave_the_facility_a_rating_of_7_or_8_on_a_scale_from_0_lowest_to_10_highest',
        'patients_who_gave_the_facility_a_rating_of_9_or_10_on_a_scale_from_0_lowest_to_10_highest',
        'patients_who_reported_no_they_would_not_recommend_the_facility_to_family_or_friends',
        'patients_who_reported_probably_yes_they_would_recommend_the_facility_to_family_or_friends',
        'patients_who_reported_that_staff_definitely_communicated_about_what_to_expect_during_and_after_the_procedure',
        'patients_who_reported_that_staff_definitely_gave_care_in_a_professional_way_and_the_facility_was_clean',
        'patients_who_reported_that_staff_did_not_communicate_about_what_to_expect_during_and_after_the_procedure',
        'patients_who_reported_that_staff_did_not_give_care_in_a_professional_way_or_the_facility_was_not_clean',
        'patients_who_reported_that_staff_somewhat_communicated_about_what_to_expect_during_and_after_the_procedure',
        'patients_who_reported_that_staff_somewhat_gave_care_in_a_professional_way_or_the_facility_was_somewhat_clean',
        'patients_who_reported_yes_they_would_definitely_recommend_the_facility_to_family_or_friends']

    for col in fill_list:
        df.loc[:,col] = df.loc[:,col].fillna(value=0)


#     ## List columns to be integers, floats, and categories
#     int_list = ['patients_rating_of_the_facility_linear_mean_score', 'communication_about_your_procedure_linear_mean_score','facilities_and_staff_linear_mean_score', 'number_of_completed_surveys','number_of_sampled_patients', 'patients_recommending_the_facility_linear_mean_score','patients_who_gave_the_facility_a_rating_of_0_to_6_on_a_scale_from_0_lowest_to_10_highest', 'patients_who_gave_the_facility_a_rating_of_7_or_8_on_a_scale_from_0_lowest_to_10_highest', 'patients_who_gave_the_facility_a_rating_of_9_or_10_on_a_scale_from_0_lowest_to_10_highest','patients_who_reported_no_they_would_not_recommend_the_facility_to_family_or_friends','patients_who_reported_probably_yes_they_would_recommend_the_facility_to_family_or_friends','patients_who_reported_that_staff_definitely_communicated_about_what_to_expect_during_and_after_the_procedure', 'patients_who_reported_that_staff_definitely_gave_care_in_a_professional_way_and_the_facility_was_clean', 'patients_who_reported_that_staff_did_not_communicate_about_what_to_expect_during_and_after_the_procedure', 'patients_who_reported_that_staff_did_not_give_care_in_a_professional_way_or_the_facility_was_not_clean','patients_who_reported_that_staff_somewhat_communicated_about_what_to_expect_during_and_after_the_procedure','patients_who_reported_that_staff_somewhat_gave_care_in_a_professional_way_or_the_facility_was_somewhat_clean','patients_who_reported_yes_they_would_definitely_recommend_the_facility_to_family_or_friends', 'survey_response_rate_percent', 'rate_denominator', 'provider_id']
#     float_list = ['rate_of_readmission_higher_estimate', 'rate_of_readmission_lower_estimate', 'rate_of_readmission']
#     category_list = ['county_name','state', 'is-MUA?', 'MUA-status-desc', 'MUA-population-type', 'Rural Status Description']

#     ## Change data types as above
#     for col in int_list:
#         df[col] = df[col].astype('int')

#     for col in float_list:
#         df[col] = df[col].astype('float')

#     for col in category_list:
#         df[col] = df[col].astype('category')

    return df 

merged_df=cleanup_merged_df(merged_dff).reset_index().drop(columns='index')

Variables in the complete dataset

In []:
print("The merged data set has", merged_df.shape[0], "rows and", merged_df.shape[1], "columns.\n\nColumn names: \n")

for col in explore.get_cols(merged_df):
    print(col)
Out[]:
The merged data set has 814 rows and 50 columns.

Column names: 

address
city
communication_about_your_procedure_linear_mean_score
county_name
facilities_and_staff_linear_mean_score
facility_name
number_of_completed_surveys
number_of_sampled_patients
patients_rating_of_the_facility_linear_mean_score
patients_recommending_the_facility_linear_mean_score
patients_who_gave_the_facility_a_rating_of_0_to_6_on_a_scale_from_0_lowest_to_10_highest
patients_who_gave_the_facility_a_rating_of_7_or_8_on_a_scale_from_0_lowest_to_10_highest
patients_who_gave_the_facility_a_rating_of_9_or_10_on_a_scale_from_0_lowest_to_10_highest
patients_who_reported_no_they_would_not_recommend_the_facility_to_family_or_friends
patients_who_reported_probably_yes_they_would_recommend_the_facility_to_family_or_friends
patients_who_reported_that_staff_definitely_communicated_about_what_to_expect_during_and_after_the_procedure
patients_who_reported_that_staff_definitely_gave_care_in_a_professional_way_and_the_facility_was_clean
patients_who_reported_that_staff_did_not_communicate_about_what_to_expect_during_and_after_the_procedure
patients_who_reported_that_staff_did_not_give_care_in_a_professional_way_or_the_facility_was_not_clean
patients_who_reported_that_staff_somewhat_communicated_about_what_to_expect_during_and_after_the_procedure
patients_who_reported_that_staff_somewhat_gave_care_in_a_professional_way_or_the_facility_was_somewhat_clean
patients_who_reported_yes_they_would_definitely_recommend_the_facility_to_family_or_friends
state
survey_response_rate_percent
zip
rate_of_readmission_national_comparison
rate_denominator
rate_of_readmission_higher_estimate
location
rate_of_readmission_lower_estimate
measure_end_date
measure_start_date
provider_id
rate_of_readmission
MUA/P Area Code
is-MUA?
MUA-status-desc
IMU Score
MUA/P Update Date
MUA/P Update Date.1
Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level
Percentage of Population Age 65 and Over
Infant Mortality Rate
Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level Index of Medical Underservice Score
Percentage of Population Age 65 and Over IMU Score
Infant Mortality Rate IMU Score
Ratio of Providers per 1000 Population IMU Score
MUA-population-type
Rural Status Description
Providers per 1000 Population

Finally time to plot!

1. Plotting the distribution of the responses to the CAHPS survey questions

I was interested in looking at the reponses to 4 questions:

  • did the staff communicate about what to expect during and after the procedure?
  • would you recommend the facility to family or friends?
  • did staff give care in a professional way and was the facility clean?
  • what was your overall rating of the favility from 0 being the lowest to 10 being the highest?

The survey respondants provided Likert-type responses to each question. This dataset gives counts of how many respondant provided each type of response.

I turned those counts into proportions of total respondants in order to look at the distribution of responses for the four questions above:

In []:
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()

def plotly_resp_by_state(df, columns, names, graphTitle):
    """ A function to plot the proportion of patients who gave each response to a survey question
    
    Arguments:
    columns = the name of the variable columns as a list of strings 
    names = the to-be names of the variable columns as a list of strings 
    graphTitle = title of the graph as a string
    
    Ouput: plotly plot"""
    
    #group by state
    grp_df=df.groupby(by='state').sum().reset_index()

    # create dict of current and to-be column names 
    name_dict={}

    for i, col in enumerate(columns):
        name_dict[col]=names[i]
    
    # add the following two columns to column list
    col_list=['state','number_of_completed_surveys']
    col_list.extend(columns)

    # create new df from grp_df and rename columns
    df=grp_df[col_list].rename(columns=name_dict)

    # change measure from count to proportion of total surveyed patients
    l=len(col_list)
    df.iloc[:,2:l]=df.values[:,2:l]/df.values[:,1,None]

    # restructure DataFrame
    df=df.melt(id_vars='state', value_vars=names, var_name='rating', value_name='proportion of patients')

    # create plot
    fig=px.strip(df, 
                 x='rating',
                 y='proportion of patients', 
                 orientation = "v", 
                 hover_data= ['state'])
    fig.update_layout(title_text=graphTitle, 
                      title_font_size=12, 
                      template="plotly_white", 
                      autosize=False,
                      width=400,
                      height=400)
    fig.update_yaxes(title_text="proportion of patients", title_font_size=12)
    fig.show()



# create a list of column lists, a list of name lists, and a list of graph ttiles for 4 measures
columns=[
    ['patients_who_reported_that_staff_did_not_communicate_about_what_to_expect_during_and_after_the_procedure','patients_who_reported_that_staff_somewhat_communicated_about_what_to_expect_during_and_after_the_procedure', 'patients_who_reported_that_staff_definitely_communicated_about_what_to_expect_during_and_after_the_procedure'],
    ['patients_who_reported_no_they_would_not_recommend_the_facility_to_family_or_friends','patients_who_reported_probably_yes_they_would_recommend_the_facility_to_family_or_friends','patients_who_reported_yes_they_would_definitely_recommend_the_facility_to_family_or_friends'],
    ['patients_who_reported_that_staff_did_not_give_care_in_a_professional_way_or_the_facility_was_not_clean','patients_who_reported_that_staff_somewhat_gave_care_in_a_professional_way_or_the_facility_was_somewhat_clean','patients_who_reported_that_staff_definitely_gave_care_in_a_professional_way_and_the_facility_was_clean'],
    ['patients_who_gave_the_facility_a_rating_of_0_to_6_on_a_scale_from_0_lowest_to_10_highest','patients_who_gave_the_facility_a_rating_of_7_or_8_on_a_scale_from_0_lowest_to_10_highest','patients_who_gave_the_facility_a_rating_of_9_or_10_on_a_scale_from_0_lowest_to_10_highest']]
names_list=[
    ['not at all', 'somewhat', 'definitely'],
    ['no','probably', 'yes definitely'],
    ['not at all', 'somewhat', 'definitely'],
    ['0-6','7-8','9-10']]
graphTitle_list=["Patient Ratings of Staff Communication about the Procedure", "Patient Ratings of If They Would Recommend the Facility", "Patient Ratings of Professional Care and Facility Cleanliness","Overall Rating of the Facility (scale from 0 lowest to 10 highest)"]



# plot each of 4 measures
for i, o in enumerate(columns):
    columns=o
    names=names_list[i]
    graphTitle=graphTitle_list[i]
    plotly_resp_by_state(merged_df, columns, names, graphTitle)
Out[]:
Out[]:
Out[]:
Out[]:
Out[]:

Hover over each point to see the state.

I was surprised to see that overall, the distribution of patient ratings is skewed towards positive reponses when asked about specific aspects of their care. This surprised me because a lot of the narrative around healthcare in the U.S. is about the inconvenience and cost. However, it appears that consistently, less than 10% of respondant gave negative ratings.

However, healthcare professionals and administrators are always trying to improve patient experience. What are some potential strategies for doing this?

Possible candidates include the practices that have been shown to decrease readmission rates, such as:

  • medication reconciliation
  • follow-up phone calls
  • handoff to PCPs promptly after surgery

However, sometimes, practices that have a positive effect on one outcome measure may not improve another. For instance, swift handoff can make patients feel like they aren't being given attention and thereby decrease patient satisfaction. For this reason, I wanted to explore the question of whether increases in patient satisfaction are associated with decreased readmission rates.

2. Plotting the relationship between readmission rate and patient satisfaction

The CAHPS dataset provides linear means for patient ratings of communication about the procedures, the overall rating of the facility, and liklihood to recommend the facility.

I started by checking out the distribution of these variables using a seaborn pairplot:

In []:
import matplotlib.pyplot as plt
import seaborn as sns

#plot
d = sns.pairplot(merged_df, height = 4, vars=[
    'patients_rating_of_the_facility_linear_mean_score', 
    'patients_recommending_the_facility_linear_mean_score',
    'facilities_and_staff_linear_mean_score', 
    'communication_about_your_procedure_linear_mean_score'], )

#change axis labels
replacements = {
    'communication_about_your_procedure_linear_mean_score': 'communication', 
    'patients_recommending_the_facility_linear_mean_score': 'recommendation', 
    'facilities_and_staff_linear_mean_score' : 'facilities+staff',
    'patients_rating_of_the_facility_linear_mean_score': 'overall rating'}
for i in range(3):
    for j in range(3):
        xlabel = d.axes[i][j].get_xlabel()
        ylabel = d.axes[i][j].get_ylabel()
        if xlabel in replacements.keys():
            d.axes[i][j].set_xlabel(replacements[xlabel])
        if ylabel in replacements.keys():
            d.axes[i][j].set_ylabel(replacements[ylabel])
Out[]:

All of the linear means of the responses are mostly normally distributed and positively correlated with each other

Next, I checked the correlations between these four patient responses and readmission rate.

As the code below shows, all of the four patient reported measures above had weak negative correlations with readmission rates:

In []:
corr = merged_df.corr()['rate_of_readmission']
print("Correlation with readmission rate:")
corr.loc[(
    'patients_recommending_the_facility_linear_mean_score',
    'patients_rating_of_the_facility_linear_mean_score', 
    'facilities_and_staff_linear_mean_score', 
    'communication_about_your_procedure_linear_mean_score'),]
Out[]:
Correlation with readmission rate:
Out[]:
patients_recommending_the_facility_linear_mean_score   -0.275615
patients_rating_of_the_facility_linear_mean_score      -0.229489
facilities_and_staff_linear_mean_score                 -0.207693
communication_about_your_procedure_linear_mean_score   -0.112479
Name: rate_of_readmission, dtype: float64

I anticipated a higher association between readmission rates and patient reported outcome measures. Perhaps, the two types of measures are not affected my common underlying conditions, which may indicate that the strategies to improve patient satisfaction are distinct from those to decrease readmission rates.

Next, I used plotly to make scatterplots of readmission by each of the three patient reported outcome measures, colored by MUA status.

In []:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig=make_subplots(rows=4,cols=1)

datayes=merged_df[merged_df['is-MUA?']=='Yes']
datano=merged_df[merged_df['is-MUA?']=='No']

dim_list=[['patients_rating_of_the_facility_linear_mean_score',1,1, True],
          ['patients_recommending_the_facility_linear_mean_score',2,1, False],
          ['communication_about_your_procedure_linear_mean_score', 3,1, False],
          ['facilities_and_staff_linear_mean_score', 4,1, False]]

for dim in dim_list:
        fig.append_trace(go.Scatter(
            x=datayes['rate_of_readmission'], 
            y = datayes[dim[0]], 
            mode='markers', 
            name='MUA',
            marker_color='#cea2fd',
            hoverinfo="text", 
            hovertext=datayes.facility_name + ' <br> ' + datayes.city + ' ,' + datayes.state, 
            showlegend=dim[3]), 
            row=dim[1], col=dim[2])
        fig.append_trace(go.Scatter(
            x=datano['rate_of_readmission'], 
            y = datano[dim[0]], 
            mode='markers', 
            name='not MUA',
            marker_color='#137e6d',
            hoverinfo="text", 
            hovertext=datano.facility_name + ' <br> ' + datano.city + ' , ' + datano.state, 
            showlegend=dim[3]), 
            row=dim[1], col=dim[2])


fig.update_layout(template="plotly_white", autosize=False,width=700,height=1000)
fig.update_xaxes(title_text="Readmission Rate", row=4, col=1)
fig.update_yaxes(title_text="Overall Rating", row=1, col=1)
fig.update_yaxes(title_text="Recommendation", row=2, col=1)
fig.update_yaxes(title_text="Communication", row=3, col=1)
fig.update_yaxes(title_text="Facilities and Staff", row=4, col=1)


fig.show()
Out[]:

3. Visualizing the difference between MUA and non-MUA hospitals

Above, I colored the datapoints by MUA status, hoping to get some clues about conditions that could be contributing to outcomes.

The measures that contribute to MUA status are candidates. However, there doesn't seem to be a difference between MUA hospitals and non-MUA hospitals in the association between readmission rate and any of the 4 patient reported outcome measures.

To take a closer look at this, I used boxplots to visualize the difference between MUA and non-MUA hospitals in these four outcome measures.

In []:
fig=make_subplots(rows=2,cols=2)

dim_list=[['patients_rating_of_the_facility_linear_mean_score','#a2bffe',1,1],
          ['patients_recommending_the_facility_linear_mean_score','#0d75f8',1,2],
          ['communication_about_your_procedure_linear_mean_score', '#0e87cc', 2,1], 
          ['rate_of_readmission','#6a79f7', 2,2]]

for dim in dim_list:
    fig.append_trace(go.Box(
        x = merged_df['is-MUA?'], 
        y = merged_df[dim[0]],
        boxpoints='all',
        jitter=0.3, 
        width=.2,
        marker=dict(color = dim[1], line=dict(width=.5,color='DarkSlateGrey')),
        hoverinfo="text", 
        hovertext = merged_df.facility_name + ' <br> ' + merged_df.city + ' , ' + merged_df.state,
        showlegend=False), 
        row=dim[2], col=dim[3])
    

fig.update_layout(title_text="Outcomes Measures in MUA and non-MUA Facilities",template="plotly_white", autosize=False,width=700,height=700)
fig.update_xaxes(title_text="is MUA?")
fig.update_yaxes(title_text="Overall Rating", row=1, col=1)
fig.update_yaxes(title_text="Recommendation", row=1, col=2)
fig.update_yaxes(title_text="Communication", row=2, col=1)
fig.update_yaxes(title_text="Readmission Rate", row=2, col=2)


fig.show()
Out[]:

These boxplots confirm that the distributions of each of these four outcome measures don't differ greatly between MUA and non-MUA hospitals.

Except perhaps that in MUA hospitals, patients rated that they would be less likely to recommend the facility to family and friends.

This visualization doesn't indicate that the measures that contribute to MUA status would be good targets for investment for those trying to improve outcome measures.

However, MUA status is a category that is composed of multiple measures, as mentioned previously. I was curious about looking at each of these measures separately.

4. Disentangling MUA status

I made a scatterplot for each of the measures that contribute to MUA status and readmission rate, responses on recommendation, and responses on overall rating.

In []:
fig=make_subplots(rows=4,cols=3)

dim_list=[['Ratio of Providers per 1000 Population IMU Score',1],
          ['Infant Mortality Rate',2],
          ['Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level', 3],
          ['Percentage of Population Age 65 and Over', 4]]

for dim in dim_list:
        fig.append_trace(go.Scatter(
            x =merged_df['rate_of_readmission'], 
            y = merged_df[dim[0]], 
            mode='markers', 
            marker_color='#40a368',
            marker=dict(color = dim[1], line=dict(width=.5,color='DarkSlateGrey')),
            hoverinfo="text", 
            hovertext = merged_df.facility_name + ' <br> ' + merged_df.city + ' , ' + merged_df.state,
            showlegend=False), 
            row=dim[1], col=1)
        fig.append_trace(go.Scatter(
            x =merged_df['patients_recommending_the_facility_linear_mean_score'], 
            y = merged_df[dim[0]], 
            mode='markers', 
            marker_color='#40a368',
            marker=dict(color = dim[1], line=dict(width=.5,color='DarkSlateGrey')),
            hoverinfo="text", 
            hovertext = merged_df.facility_name + ' <br> ' + merged_df.city + ' , ' + merged_df.state,
            showlegend=False), 
            row=dim[1], col=2)
        fig.append_trace(go.Scatter(
            x =merged_df['patients_rating_of_the_facility_linear_mean_score'], 
            y = merged_df[dim[0]], 
            mode='markers', 
            marker_color='#40a368',
            marker=dict(color = dim[1], line=dict(width=.5,color='DarkSlateGrey')),
            hoverinfo="text", 
            hovertext = merged_df.facility_name + ' <br> ' + merged_df.city + ' , ' + merged_df.state,
            showlegend=False), 
            row=dim[1], col=3)
        
fig.update_layout(template="plotly_white", autosize=False,width=700,height=900)
fig.update_xaxes(title_text="Readmission Rate", row=4, col=1)
fig.update_xaxes(title_text="Recommendation", row=4, col=2)
fig.update_xaxes(title_text="Overall Rating", row=4, col=3)

fig.update_yaxes(title_text="Ratio of Providers per 1000 Pop", row=1, col=1, title_font = dict(size=10))
fig.update_yaxes(title_text='Infant Mortality Rate', row=2, col=1, title_font = dict(size=10))
fig.update_yaxes(title_text="Pop Below US Fed Poverty Level (%)", row=3, col=1, title_font = dict(size=10))
fig.update_yaxes(title_text='Pop => 65 years (%)', row=4, col=1, title_font = dict(size=10))


fig.show()
Out[]:

There don't seem to be any strong correlations here. The correlation coefficients back this up:

In []:
corr = {
    "readmission rate" : merged_df.corr()['rate_of_readmission'], 
    "recommendation" : merged_df.corr()['patients_recommending_the_facility_linear_mean_score'], 
    "overall rating" : merged_df.corr()['patients_rating_of_the_facility_linear_mean_score']}
corr = pd.DataFrame(corr)
corr.loc[(
    'Ratio of Providers per 1000 Population IMU Score',
    'Infant Mortality Rate', 
    'Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level', 
    'Percentage of Population Age 65 and Over'),:]
Out[]:
readmission rate recommendation overall rating
Ratio of Providers per 1000 Population IMU Score -0.084948 -0.005880 0.003425
Infant Mortality Rate 0.189895 -0.085036 -0.028457
Percent of Population with Incomes at or Below 100 Percent of the U.S. Federal Poverty Level -0.015887 0.040402 0.057970
Percentage of Population Age 65 and Over 0.060077 0.002508 -0.064389

5. Ploting results on a map

For fun, I made a plotly map of all the hospitals in this dataset, since it included coordinates. Scroll to zoom in!

In []:
location_df=merged_df[merged_df['location'].notnull()].reset_index()

for i, row in location_df.iterrows():
    location_df.loc[i,'lat']=row['location']['coordinates'][1]
    location_df.loc[i,'lon']=row['location']['coordinates'][0]

location_df['state']=location_df['state'].astype('object')

fig=go.Figure()
datayes=location_df[location_df['is-MUA?']=='Yes']
datano=location_df[location_df['is-MUA?']=='No']
fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = datayes.lon,
        lat = datayes.lat,    
        name='MUA',
        hoverinfo="text",
        text = datayes.facility_name + ' <br> ' + datayes.city + ' , ' + datayes.state,
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.8,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            colorscale = 'Portland',
            cmin = 0,
            color = location_df['rate_of_readmission'],
            colorbar_title="Rate of Readmission")))
fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = datano.lon,
        lat = datano.lat,
        name='not MUA', 
        hoverinfo="text",
        text = datano.facility_name + ' <br> ' + datano.city + ' , ' + datano.state,
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.8,
            reversescale = True,
            autocolorscale = False,
            symbol = 'square',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
            colorscale = 'Portland',
            cmin = 0,
            color = location_df['rate_of_readmission'],
            colorbar_title="Rate of Readmission")))
fig.update_layout(legend=dict(x=-.1, y=1.2),
        geo = dict(
            scope='usa',
            projection_type='albers usa',
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )


fig.show()
Out[]:

Summary

As mentioned, healthcare outcome measures can sometimes come into conflict with each other, with some interventions having positive effects on some measures but not others. For instance, patient satisfaction is sometimes used as a balance measure for interventions aiming to affect another outcome measure. My exploration showed that while minor, there is a negative association between measures of patient satisfaction and readmission rates (as I expected). This suggests that hospital administrators shouldn't necessarily be concerned that measures to decrease readmission rates would decrease patient satisfaction.

Additionally, I wanted to take a look at the relationship between these factors and MUA status and measures to explore whether investment in any of these areas (say, decreasing infant mortality) could also have positive impact on readmission rates and patient reported outcome measures. While some have minimal association, it appears that the underlying causes of these factors may be different. An important note is that these measures only cover a subset of the ways in which patients are medically underserved. Some areas with higher IMU scores may be underserving certain specific populations (such as those with low-income, or non-English speakers).

In the future, I want to take a closer look at what aspects of healthcare drive patient satisfaction. Gaining insight into how disparate factors influence patient satisfaction and other healthcare statistics would be helpful for administrators decided how to allocate resources

An interesting resource

Return to the top