laloluna921 - Tumblr blog

laloluna921 · 9 months ago

Text

Course

Decision trees

Dataset Overview: The dataset contains 11,170 samples with 666 features. Decision Tree Model: The model was trained on 6,702 samples and tested on 4,468 samples. The accuracy of the model is approximately 72.83%, which indicates a moderate level of predictive power. Confusion Matrix: [[3246 7 0 13] [ 726 2 0 2] [ 89 0 0 0] [ 374 3 0 6]] The model performs well in predicting the majority class (likely "No diagnosis"), but struggles with minority classes. There's a significant class imbalance, which may affect the model's performance on less common outcomes. Decision Tree Structure: The tree has a maximum depth of 3, which helps prevent overfitting but may limit its ability to capture complex patterns. Key decision points: a. Family income is the top-level split, indicating its importance in predicting alcohol abuse/dependence. b. Age of onset is the second most important factor, appearing in multiple decision nodes. c. Personal income also plays a role, but appears less frequently than family income and age of onset. Key Insights: For lower family income (<=0.50), age of onset is crucial: Very early onset (<=16.50) is associated with the highest risk (class 3). Slightly later onset (16.50-17.50) is associated with a moderate risk (class 1). Later onset (>17.50) is generally associated with lower risk (class 0). For higher family income (>0.50), the risk is generally lower (mostly class 0), with some influence from personal income and age of onset. Limitations and Considerations: The model's accuracy and the tree structure suggest that while these factors (family income, age of onset, personal income) are important, they don't fully explain alcohol abuse/dependence patterns. The class imbalance in the dataset may be affecting the model's performance and should be addressed for more reliable predictions across all classes. Further feature engineering or the use of more advanced models might improve predictive power. In conclusion, this decision tree model provides valuable insights into the relationships between socioeconomic factors, age of onset, and alcohol abuse/dependence. However, its moderate accuracy suggests that additional factors or more complex modeling techniques might be necessary for a more comprehensive understanding of the issue.

0 notes

laloluna921 · 10 months ago

Text

Week 4

Based on the results of my analysis, I found that there is a significant association between total family income (S1Q11B) and alcohol abuse/dependence (ALCABDEP12DX). This was determined by performing a Chi-Square test, which yielded a test statistic of 277.33 and a p-value of 0.0000. Since the p-value is less than 0.05, I rejected the null hypothesis, which stated that there is no association between the two variables.

I also performed stratified analyses to test for potential moderation effects. In each stratum, I again found a significant association between total family income and alcohol abuse/dependence. The Chi-Square statistics and p-values for each stratum were as follows:

Stratum 1: Chi-Square Statistic = 88.74, P-value = 0.0093

Stratum 2: Chi-Square Statistic = 90.76, P-value = 0.0063

Stratum 3: Chi-Square Statistic = 139.92, P-value = 0.0000

Stratum 4: Chi-Square Statistic = 105.14, P-value = 0.0003

Stratum 5: Chi-Square Statistic = 81.08, P-value = 0.0363

In each case, the p-value was less than 0.05, leading me to reject the null hypothesis and conclude that there is a significant association between total family income and alcohol abuse/dependence within each stratum.

These results suggest that the relationship between total family income and alcohol abuse/dependence may be moderated by the stratifying variable. However, further analysis would be needed to confirm this and to understand the nature of this potential moderation effect.

0 notes

laloluna921 · 10 months ago

Text

Week 3

Based on the results of my analysis, I found that the Correlation Coefficient between ‘Total Family Income’ and ‘Alcohol Abuse/Dependence’ is -0.01. This indicates a very weak negative linear relationship between these two variables. In other words, as the total family income slightly increases, the alcohol abuse/dependence slightly decreases, and vice versa. However, the relationship is so weak that it’s almost negligible.

Furthermore, the Coefficient of Determination (R^2) is 0.00. This means that none of the variability in ‘Alcohol Abuse/Dependence’ can be explained by ‘Total Family Income’. In other words, knowing the total family income does not help me predict the alcohol abuse/dependence.

It’s important to note that correlation does not imply causation. Even though there is a correlation (albeit a very weak one), it does not mean that changes in one variable cause changes in the other. There may be other factors at play influencing both variables.

Also, this analysis assumes that the categories in my variables are ordered and can be interpreted as numerical values. If this is not the case, the correlation coefficient may not be the best statistic to describe the relationship between these variables.

0 notes

laloluna921 · 11 months ago

Text

Week 2 Running a Chi-Square Test of Independence

Based on the Chi-Square Test of Independence results I've analyzed:

Chi-Square Statistic: 277.33 P-value: practically 0 Degrees of Freedom: 60 Given that the p-value is significantly lower than 0.05, which is the typical threshold for determining statistical significance, I've concluded that there is indeed a statistically significant association between "Total Family Income in the Last 12 Months" (S1Q11B) and "Alcohol Abuse/Dependence in the Last 12 Months" (ALCABDEP12DX). This leads me to reject the null hypothesis, which posited that the two analyzed variables were independent of each other. While the results show a clear association, they do not provide specific details on the strength or direction of this relationship. Nonetheless, it's evident that family income level and alcohol abuse/dependence are related within the dataset I examined.

The Chi-Square statistic value, being markedly high, further affirms the strong association between the variables across the spectrum of degrees of freedom, which, in this case, is 60. This broad scope suggests a complex analysis involving multiple categories.

The expected frequencies were particularly telling. They illustrate the distribution one would anticipate if there was no link between the income levels and alcohol abuse/dependence variables. The considerable divergence from these expected frequencies to what was observed—signaled by the high Chi-Square value and the negligible p-value—underscores the presence of an association.

Given these significant findings, it's clear that a post hoc analysis is warranted to unpack the nuances of how specific income levels correlate with instances of alcohol abuse/dependence.

Moreover, I recognize it's crucial to consider this association within a broader context. Additional factors, including demographic variables, cultural attitudes towards alcohol consumption, and further socioeconomic indicators, could also play pivotal roles in shaping this observed relationship. Such nuanced analysis would provide a more comprehensive understanding of the dynamics at play.

0 notes

laloluna921 · 11 months ago

Text

Week 1 Running an analysis of variance

Based on the provided ANOVA results for the different subsets, here's the analysis:

Very high income and very high alcohol abuse/dependence subset:

The ANOVA results show nan (not a number) for both the F-statistic and p-value. This issue typically arises when there is insufficient data or when the groups being compared have zero variance (all values are the same). It's possible that this subset has too few observations or limited variation in the ALCABDEP12DX variable within each income group. In this case, the ANOVA results are inconclusive, and we cannot make any meaningful interpretations.

High income and high alcohol abuse/dependence subset:

The ANOVA results show an F-statistic of 0.36 and a p-value of 0.6959. Since the p-value is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This suggests that there is no significant difference in alcohol abuse/dependence scores among the different high-income groups within this subset.

Middle income and moderate alcohol abuse/dependence subset:

The ANOVA results show nan for both the F-statistic and p-value, similar to the very high income subset. This issue may be caused by insufficient data or lack of variation in the ALCABDEP12DX variable within each income group. As with the very high income subset, the ANOVA results are inconclusive, and no meaningful interpretations can be made.

Low income and moderate alcohol abuse/dependence subset:

The ANOVA results also show nan for both the F-statistic and p-value. The same limitations apply, and no meaningful interpretations can be made for this subset.

Very low income and high alcohol abuse/dependence subset:

The ANOVA results show an F-statistic of 1.29 and a p-value of 0.2779. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This indicates that there is no significant difference in alcohol abuse/dependence scores among the different very low-income groups within this subset.

In summary, for the subsets where the ANOVA results are valid (high_income_high_abuse and very_low_income_high_abuse), we fail to reject the null hypothesis, suggesting no significant difference in alcohol abuse/dependence scores among the different income groups within each subset. However, for the subsets with nan values (very_high_income_very_high_abuse, middle_income_moderate_abuse, and low_income_moderate_abuse), the ANOVA results are inconclusive due to insufficient data or lack of variation within the groups. Further investigation or alternative analysis techniques may be required for these subsets.

0 notes

laloluna921 · 11 months ago

Text

Week 4

0 notes

laloluna921 · 1 year ago

Text

Week 3 DA Course

The variables I will use were already categorized so I just wrote the description of each one and commented my code.

0 notes

laloluna921 · 1 year ago

Text

First Program with Python

I utilized the Nesarc dataset to create a subset that represents individuals with a total monthly family income of less than 10,000, as well as those who are dependent or both dependent and have alcohol abuse issues.

0 notes

laloluna921 · 1 year ago

Text

The Interplay of Socioeconomic Status and Alcohol Consumption: Implications for Life Expectancy

I’ve chosen the NESARC dataset about life expectancy associated with alcohol consumption. This dataset is rich and provides a lot of interesting variables to explore.

This is a topic that has always intrigued me and I believe this dataset provides a great opportunity to explore it further.

CodeBook

Variable Name

Description

alcconsumption

2008 alcohol consumption per adult (age 15+), litres

lifeexpectancy

2011 life expectancy at birth (years)

Questions:

Is there a correlation between per capita income (income_per_person) and life expectancy (life_expectancy)?

How does alcohol consumption (alcohol_consumption) vary with per capita income (income_per_person)?

Is there a correlation between the level of education (education_level) and alcohol consumption (alcohol_consumption)?

How does alcohol consumption (alcohol_consumption) affect life expectancy (life_expectancy)?

Is there a difference in alcohol consumption (alcohol_consumption) and life expectancy (life_expectancy) between genders (gender)?

Variables:

Per capita income (income_per_person)

Life expectancy (life_expectancy)

Alcohol consumption (alcohol_consumption)

Level of education (education_level)

Gender (gender)

incomeperperson

This is the Gross Domestic Product per capita in constant 2000 US$

New CodeBook

income_per_person

This variable represents the per capita income for each country. It’s a numerical variable measured in international dollars, fixed 2011 prices.

life_expectancy

This variable indicates the average number of years a newborn child would live if current mortality patterns were to stay the same throughout its life. It’s a numerical variable measured in years.

alcohol_consumption

This variable represents the recorded and estimated average alcohol consumption, adult (15+) per capita consumption in liters pure alcohol. It’s a numerical variable measured in liters.

education_level:

This variable indicates the average years of schooling for adults aged 25 and older. It’s a numerical variable measured in years.

References

Hawkins, B.R., & McCambridge, J. (2023). Association Between Daily Alcohol Intake and Risk of All-Cause Mortality: A Systematic Review and Meta-analyses. JAMA Network Open.

This study found that daily low or moderate alcohol intake was not significantly associated with all-cause mortality risk, while increased risk was evident at higher consumption levels, starting at lower levels for women than men.

Murakami, K., & Hashimoto, H. (2019). Associations of education and income with heavy drinking and problem drinking among men: evidence from a population-based study in Japan. BMC Public Health.

The study revealed that lower educational attainment was significantly associated with increased risks of both non-problematic heavy drinking and problem drinking. Lower income was significantly associated with a lower risk of non-problematic heavy drinking, but not of problem drinking.

Nooyens, A.C.J., Bueno-de-Mesquita, H.B., van Boxtel, M.P.J., van Gelder, B.M., Verhagen, H., & Verschuren, W.M.M. (2020). Alcohol consumption in later life and reaching longevity: the Netherlands Cohort Study. Age and Ageing.

The study found that in women, the total consumption of alcoholic beverages was inversely associated with the decline in global cognitive function over a 5-year period. Red wine consumption was inversely associated with the decline in global cognitive function as well as memory and flexibility.

Rigelsky, M., & Zelenka, V. (2021). Does Alcohol Consumption Affect Life Expectancy in OECD Countries. ResearchGate.

The research concluded that higher income was associated with greater longevity throughout the income distribution. The gap in life expectancy between the richest 1% and poorest 1% of individuals was 14.6 years for men and 10.1 years for women.

Chetty, R., Stepner, M., Abraham, S., Lin, S., Scuderi, B., Turner, N., Bergeron, A., & Cutler, D. (2016). The Association Between Income and Life Expectancy in the United States, 2001-2014. JAMA.

The study found that higher income was associated with greater longevity, and differences in life expectancy across income groups increased over time. Life expectancy for low-income individuals varied substantially across local areas

Given the variables selected from the Gapminder dataset life expectancy, alcohol consumption, and income per person.

Hypothesis

The socioeconomic status, characterized by factors such as income and education, along with lifestyle choices like alcohol consumption, significantly impacts an individual’s life expectancy and overall health. Specifically, higher income and education levels may be associated with lower risks of heavy and problematic drinking, which in turn could lead to increased longevity. However, the relationship between alcohol consumption and health outcomes might be complex and influenced by factors such as the type and amount of alcohol consumed, and the individual’s overall lifestyle and genetic predisposition.

2 notes · View notes