Feature Selection on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We need a feature selection technique that can discover a relatively small subset of the most relevant features. In what follows we present a tutorial on using QCi's technology to select a set of features by minimizing their inter-correlation. This approach can be used with any unsupervised machine learning approach such as anomaly detection and clustering algorithms.

Methodology

Let us have a dataset with $N$ $N$ samples and $d$ $d$ features, represented by a $N \times d$ $N \times d$ matrix $F$ $F$ . Moreover, let us each column of $F$ $F$ by $\bf{f}_{j}$ $f_{j}$ where $j \in {1,...,d}$ $j \in 1, ..., d$ , representing all samples for feature $j$ $j$ . We can choose a subset of features of size $d^\prime$ $d^{'}$ ( $d^\prime < d$ $d^{'} < d$ ) such that the inter-correlation of the subset is minimal. We have

\min_{\bf{x}} \sum_{i=1}^{d} \sum_{j=1}^{d} C_{ij} x_{i} x_{j}

min_{x} \sum_{i = 1}^{d} \sum_{j = 1}^{d} C_{ij} x_{i} x_{j}

where

C_{ij} = |corr(\bf{f_{i}}, \bf{f_{j}})|

C_{ij} = ∣ corr (f_{i}, f_{j}) ∣

where $corr$ $corr$ denotes a correlation function such as the Pearson correlation, and $x_{i} \in \{0, 1\}$ $x_{i} \in {0, 1}$ is a binary variable indicating inclusion or exclusion of feature $i$ $i$ . Obviously, $C_{ii} = 1$ $C_{ii} = 1$ . The above minimization problem is subject to a constraint,

\sum_{i=1}^{d} x_{i} = d^\prime

\sum_{i = 1}^{d} x_{i} = d^{'}

We can exclude the diagonal elements of $C$ $C$ as they always add up to $d^\prime$ $d^{'}$ . In the matrix form we have,

\min_{\bf{x}} \bf{{x}^{T}} (C - I) \bf{{x}}

min_{x} x^{T} (C - I) x

subject to the above constraint. Note that $I$ $I$ is a $d \times d$ $d \times d$ identity matrix. Note too that we have assumed that the reduced dimension $d^\prime$ $d^{'}$ is assumed to be given in the above approach.

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [1]:

import pandas as pd
# Input                                                                                       
INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
OUT_FILE  = "cleaned_medicare_data.csv"
CON_VARS = [
"total_claim_count",
"total_30_day_fill_count",
"total_drug_cost",
"total_day_supply",
"bene_count",
"total_claim_count_ge65",
"total_30_day_fill_count_ge65",
"total_drug_cost_ge65",
"total_day_supply_ge65",
"bene_count_ge65",
"brand_claim_count",
"brand_drug_cost",
"generic_claim_count",
"generic_drug_cost",
"other_claim_count",
"other_drug_cost",
"mapd_claim_count",
"mapd_drug_cost",
"pdp_claim_count",
"pdp_drug_cost",
"lis_claim_count",
"lis_drug_cost",
"nonlis_claim_count",
"nonlis_drug_cost",
"opioid_claim_count",
"opioid_drug_cost",
"opioid_day_supply",
"opioid_bene_count",
"opioid_prescriber_rate",
"antibiotic_claim_count",
"antibiotic_drug_cost",
"antibiotic_bene_count",
"hrm_claim_count_ge65",
"hrm_drug_cost_ge65",
"hrm_bene_count_ge65",
"antipsych_claim_count_ge65",
"antipsych_drug_cost_ge65",
"antipsych_bene_count_ge65",
"average_age_of_beneficiaries",
"beneficiary_age_less_65_count",
"beneficiary_age_65_74_count",
"beneficiary_age_75_84_count",
"beneficiary_age_greater_84_count",
"beneficiary_female_count",
"beneficiary_male_count",
"beneficiary_race_white_count",
"beneficiary_race_black_count",
"beneficiary_race_asian_pi_count",
"beneficiary_race_hispanic_count",
"beneficiary_race_nat_ind_count",
"beneficiary_race_other_count",
"beneficiary_nondual_count",
"beneficiary_dual_count",
"beneficiary_average_risk_score",
]
VALID_PROVIDER_MI = [
"A",
"M",
"J",
"L",
"R",
"S",
"E",
"D",
"C",
"B",
"K",
"P",
"W",
"H",
"T",
"G",
"F",
"N",
"V",
"I",
"O",
"Y",
"Z",
"U", 
"Q",
"X",
]
VALID_GEN = ["F", "M", "Other", "Unknown"]
VALID_ENTITIES = ["I", "O"]
VALID_DESC_FLAGS = ["S", "T"]
VALID_ENROLLS = ["E", "N", "O"]
# Some utilities                                                                              
def convert_to_float(x):
try:
return float(x)
except:
return None
def convert_to_int(x):
try:
return int(float(x))
except:
return None
# Read data                                                                                   
df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
# Clean categorical variables                                                                 
df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
)
df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
df["nppes_credentials"] = df["nppes_credentials"].apply(
lambda x: str(x).replace(".", "")
)
cred_hash = {
"MEDICAL DOCTOR": "MD",
"NURSE PRACTITIONER": "NP",
}
df["nppes_credentials"] = df["nppes_credentials"].apply(
lambda x: cred_hash[x] if x in cred_hash else x,
)
df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
lambda x: x if x in VALID_GEN else "Other",
)
df["nppes_entity_code"] = df["nppes_entity_code"].apply(
lambda x: x if x in VALID_ENTITIES else "Unknown",
)
df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
df["nppes_provider_country"] = df["nppes_provider_country"].apply(
lambda x: "US" if x == "US" else "Other",
)
df["description_flag"] = df["description_flag"].apply(
lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
)
df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
lambda x: x if x in VALID_ENROLLS else "Unknown",
)
# Treat missing beneficiary count as it cannot be zero                                        
df["bene_count"] = df["bene_count"].apply(
convert_to_int
).fillna(-1)
tmp_df = df.groupby(
"specialty_description", as_index=False,
)["bene_count"].mean()
bene_count_hash = dict(
zip(
tmp_df["specialty_description"],
tmp_df["bene_count"],
)
)
df["bene_count"] = df.apply(
lambda x: x["bene_count"] if x[
"bene_count"
] > 0 else bene_count_hash[
x["specialty_description"]
],
axis=1,
)
# Treat continuous variables                                                                  
for item in CON_VARS:
df[item] = df[item].apply(
convert_to_float
).fillna(0.0)
# Filter out invalid states                                                                   
df  = df[
~df["nppes_provider_state"].isin(
["XX", "E", "N", "S"]
)
]
# Output                                                                                      
df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [2]:

import pandas as pd
# Input                                                                                       
INP_FILE = "cleaned_medicare_data.csv"
OUT_FILE  = "medicare_features.csv"
# Set some parameters                                                                         
CAT_VARS = [
"nppes_provider_mi",
#"nppes_credentials", # This is rather messy, so ignoring it.                             
"nppes_provider_gender",
"nppes_entity_code",
"nppes_provider_city",
"nppes_provider_zip5",
#"nppes_provider_country", # Almost all cases are US                                      
"specialty_description",
"medicare_prvdr_enroll_status",
"nppes_provider_state",
]
CON_VARS = [
"total_claim_count",
"total_30_day_fill_count",
"total_drug_cost",
"total_day_supply",
"bene_count",
"total_claim_count_ge65",
"total_30_day_fill_count_ge65",
"total_drug_cost_ge65",
"total_day_supply_ge65",
"bene_count_ge65",
"brand_claim_count",
"brand_drug_cost",
"generic_claim_count",
"generic_drug_cost",
"other_claim_count",
"other_drug_cost",
"mapd_claim_count",
"mapd_drug_cost",
"pdp_claim_count",
"pdp_drug_cost",
"lis_claim_count",
"lis_drug_cost",
"nonlis_claim_count",
"nonlis_drug_cost",
"opioid_claim_count",
"opioid_drug_cost",
"opioid_day_supply",
"opioid_bene_count",
"antibiotic_claim_count",
"antibiotic_drug_cost",
"antibiotic_bene_count",
"hrm_claim_count_ge65",
"hrm_drug_cost_ge65",
"hrm_bene_count_ge65",
"antipsych_claim_count_ge65",
"antipsych_drug_cost_ge65",
"antipsych_bene_count_ge65",
"average_age_of_beneficiaries",
"beneficiary_age_less_65_count",
"beneficiary_age_65_74_count",
"beneficiary_age_75_84_count",
"beneficiary_age_greater_84_count",
"beneficiary_female_count",
"beneficiary_male_count",
"beneficiary_race_white_count",
"beneficiary_race_black_count",
"beneficiary_race_asian_pi_count",
"beneficiary_race_hispanic_count",
"beneficiary_race_nat_ind_count",
"beneficiary_race_other_count",
"beneficiary_nondual_count",
"beneficiary_dual_count",
"beneficiary_average_risk_score",
]
# Read and clean data                                                                         
df = pd.read_csv(INP_FILE, low_memory=False)
# Embed categorical features                                                                  
embedded_cat_features = []
for item in CAT_VARS:
tmp_df = df.groupby(item, as_index=False).agg(
{
"opioid_claim_count": "mean",
"opioid_drug_cost": "mean",
"opioid_day_supply": "mean",
"opioid_bene_count": "mean",
"opioid_prescriber_rate": "mean",
}
).rename(
columns={
"opioid_claim_count": "%s_opioid_claim_count" % item,
"opioid_drug_cost": "%s_opioid_drug_cost" % item,
"opioid_day_supply": "%s_opioid_day_supply" % item,
"opioid_bene_count": "%s_opioid_bene_count" % item,
"opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
}
)
df = df.merge(tmp_df, how="left", on=item)
embedded_cat_features += [
"%s_opioid_claim_count" % item,
"%s_opioid_drug_cost" % item,
"%s_opioid_day_supply" % item,
"%s_opioid_bene_count" % item,
"%s_opioid_prescriber_rate" % item,
]
# Drop unembedded categorical variables and some others                                       
df = df[["npi"] + CON_VARS + embedded_cat_features]
# Write features file                                                                         
df.to_csv(OUT_FILE, index=False)

Feature Selection

Once the features are generated, we can implement the above-mentioned feature selection algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [3]:

# Import libs
import sys
import os
import time
import numpy as np
import pandas as pd
from qci_client import QciClient
# Define some parameters
FEATURES_FILE = "medicare_features.csv"
REDUCED_DIM = 10
# Read features
df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [4]:

feature_names = list(set(df.columns) - {"npi"})
orig_dim = len(feature_names)
print(
"Original dimension is %d; reduced dimension will be %d" % (
orig_dim,
REDUCED_DIM,
)
)

Out [ ]:

Original dimension is 93; reduced dimension will be 10

We should now create the objective matrix $C$ $C$ ,

In [5]:

# Generate the objective matrix
X = np.array(df[feature_names])
C = abs(np.corrcoef(X, rowvar=False))
# Make correlation symmetric to machine precision
C = 0.5 * (C + C.transpose())
objective = C - np.eye(orig_dim)
objective = np.array(objective, dtype=np.float32)

And create the constraint matrix,

In [6]:

# Generate the constraint
cons_lhs = np.ones(shape=(orig_dim), dtype=np.float32)
cons_lhs = cons_lhs
cons_rhs = np.array([-REDUCED_DIM])
constraints = np.hstack([cons_lhs, cons_rhs])
print(constraints.shape)

Out [ ]:

(94,)

We now solve the above quadratic binary problem using QCi's Dirac-1,

In [7]:

token = "your_token"
api_url = "https://api.qci-prod.com"
qci = QciClient(api_token=token, url=api_url)

In [8]:

# Create json objects
objective_json = {
"file_name": "objective_tutorial.json",
"file_config": {
"objective": {"data": objective, "num_variables": orig_dim},
}  
}
constraint_json = {
"file_name": "constraints_tutorial.json",
"file_config": {
"constraints": {
"data": constraints, 
#"num_variables": orig_dim,
#"num_constraints": 1,
}
}
}
# Solve the optimizzation problem
#qci = QciClient()
objective_file_id = qci.upload_file(file=objective_json)["file_id"]
constraint_file_id = qci.upload_file(file=constraint_json)["file_id"]
# Setup job json
job_params = {
"device_type": "dirac-1", 
"alpha": 5.0, 
"num_samples": 20,
}
body = qci.build_job_body(
job_type="sample-constraint", 
job_params=job_params,
constraints_file_id=constraint_file_id, 
objective_file_id=objective_file_id,
job_name=f"tutorial_eqc1",
job_tags=["tutorial_eqc1"],
)
# Run the job
job_response_json = qci.process_job(job_body=body)
print(job_response_json)
results = job_response_json["results"]
energies = results["energies"]
samples = results["solutions"]
is_feasibles = results["feasibilities"]
if True:
print("Energies:", energies)        
# Pick a feasible solution with lowest energy                                             
# The sample solutions are sorted by energy                                               
sol = None
for i, item in enumerate(samples):
sol = item
is_feasible = is_feasibles[i]
if is_feasible:
break
if not is_feasible:
print("Solution is not feasible!")
assert sol is not None, "No feasible solution found!"
assert len(sol) == orig_dim, "Inconsistent solution size!"
assert sum(sol) == REDUCED_DIM, "Solution is not feasible!"

Out [ ]:

2024-05-08 10:28:12 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 10:28:12 - Job submitted: job_id='663bb62cd448b017e54f94bd'
2024-05-08 10:28:12 - QUEUED
2024-05-08 10:28:15 - RUNNING
2024-05-08 10:48:23 - COMPLETED
2024-05-08 10:48:26 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663bb62cd448b017e54f94bd', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_linearly_constrained_binary_optimization': {'constraints_file_id': '663bb62c98263204a3657526', 'objective_file_id': '663bb62b98263204a3657524', 'alpha': 5, 'atol': 1e-10}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T17:28:12.687Z', 'queued_at_rfc3339nano': '2024-05-08T17:28:12.689Z', 'running_at_rfc3339nano': '2024-05-08T17:28:13.046Z', 'completed_at_rfc3339nano': '2024-05-08T17:48:21.532Z'}, 'job_result': {'file_id': '663bbae598263204a3657528', 'device_usage_s': 1128}}, 'status': 'COMPLETED', 'results': {'counts': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'energies': [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111], 'feasibilities': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], 'objective_values': [2.108758625341579, 2.109875851077959, 2.109875851077959, 2.1111518519464876, 2.1151239282917222, 2.1151239282917227, 2.115212611388415, 2.1162998459767546, 2.121476054191589, 2.13726658251835, 2.1418038606643672, 2.148616509046405, 2.1707787389168516, 2.1876401392510156, 2.2009708418045193, 2.2050841469317675, 2.2229102615965526, 2.2229102615965526, 2.227802827401319, 2.227861546212807], 'solutions': [[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]}}
Energies: [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111]

Finally, we can print the list of selected variables,

In [9]:

selected_vars = []
for i in range(orig_dim):
if sol[i] > 0:
selected_vars.append(feature_names[i])
print(selected_vars)

Out [ ]:

['nppes_provider_city_opioid_drug_cost', 'beneficiary_race_black_count', 'nppes_provider_gender_opioid_drug_cost', 'specialty_description_opioid_drug_cost', 'nppes_provider_mi_opioid_drug_cost', 'nppes_entity_code_opioid_prescriber_rate', 'beneficiary_race_nat_ind_count', 'antipsych_drug_cost_ge65', 'beneficiary_average_risk_score', 'beneficiary_race_asian_pi_count']