data-diaries - Tumblr blog

data-diaries · 4 months ago

Text

Preliminary Statistical Analyses

1. Description of Preliminary Statistical Analyses

The preliminary analysis investigates the relationships between socioeconomic factors and life expectancy using data from the World Bank dataset.

Descriptive Statistics:

The average life expectancy is 72 years, ranging from 49 to 84 years across countries.

GDP per capita has a broad distribution, with a median value of approximately $10,000.

Health expenditures (% of GDP) range from 2% to 17%.

Bivariate Analyses:

GDP Per Capita and Life Expectancy:

A strong positive correlation was observed (Pearson r = 0.65). Scatter plots confirm this relationship, with higher GDP per capita correlating with longer life expectancy.

Health Expenditures and Life Expectancy:

A moderate positive correlation (r = 0.45). The scatter plot suggests diminishing returns at higher levels of health expenditure.

Improved Water Access and Life Expectancy:

A strong positive correlation (r = 0.72). Countries with greater access to improved water sources generally exhibit higher life expectancies.

2. Plots and Graphs

Figure 1: Scatter Plot of GDP Per Capita vs. Life Expectancy Description: This plot demonstrates a clear positive association between GDP per capita and life expectancy, with wealthier nations exhibiting longer lifespans.

Figure 2: Scatter Plot of Health Expenditures (% of GDP) vs. Life Expectancy Description: This plot highlights the trend of diminishing returns, where increases in health expenditures beyond a certain point yield smaller gains in life expectancy.

Figure 3: Scatter Plot of Improved Water Access and Life Expectancy Description: This plot shows a consistent positive relationship, indicating that better access to clean water is associated with longer life expectancy.

#DataAnalysis #datascience #WorldBankData #datavisualization

0 notes

data-diaries · 4 months ago

Text

Analyzing Global Trends: The Impact of Health Expenditures and Socioeconomic Factors on Life Expectancy

Methods Section

1. Sample

Population and Selection Criteria: The dataset contains information on 248 countries collected from World Bank indicators for 2012 and 2013. For this analysis, countries with complete data for the selected variables—health expenditures, GDP per capita, improved water source access, and life expectancy—were included. Missing data resulted in excluding some observations.

Sample Size: The final sample consists of 190 countries with valid observations for all variables analyzed.

Description of the Sample: The sample includes a diverse mix of low-, middle-, and high-income countries, representing regions across the globe. This diversity provides a broad basis for understanding global trends in health and socioeconomic indicators.

2. Measures

Variables Included:

Response Variable: Life Expectancy (years): x173_2012 (2012) and x173_2013 (2013).

Predictor Variables:

Health Expenditures (% of GDP): x150_2012, x150_2013.

GDP Per Capita (Current US$): x142_2012, x142_2013.

Improved Water Source Access (% of Population): x156_2012, x156_2013.

Variable Management:

Variables were standardized (mean = 0, standard deviation = 1) to ensure consistency in scaling for statistical analysis.

Missing data were handled by excluding incomplete cases for the selected variables.

3. Analyses

Statistical Methods:

Descriptive Analysis: Summary statistics and visualizations (scatter plots, box plots) to understand data distributions and relationships between variables.

Predictive Modeling: Lasso regression was applied to identify the most significant predictors of life expectancy while handling multicollinearity among predictors.

Data Splitting: The dataset was split into training (60%) and testing (40%) subsets to evaluate the performance of the predictive model.

Cross-Validation: Ten-fold cross-validation was used to tune the regularization parameter (alpha) in Lasso regression, ensuring optimal model performance and generalizability.

0 notes

data-diaries · 4 months ago

Text

Exploring the Relationship Between Health Spending and Life Expectancy: Insights from World Bank Data

Research Question How do health expenditures and socioeconomic factors, such as GDP per capita and access to improved water sources, influence life expectancy across countries in 2012 and 2013?

Motivation/Rationale The relationship between health spending and life expectancy is critical for understanding how nations can enhance population health outcomes. By exploring the combined effects of socioeconomic factors, this study aims to provide a comprehensive perspective on what drives longevity. As someone passionate about data-driven decision-making, I am interested in uncovering actionable insights that policymakers can use to improve global health standards.

Potential Implications The findings from this analysis could guide countries in allocating resources more effectively to improve public health. Identifying key drivers of life expectancy may also help low- and middle-income countries prioritize interventions. Additionally, this research can contribute to global discussions on sustainable healthcare investments, supporting long-term strategies to reduce health disparities worldwide.

0 notes

data-diaries · 4 months ago

Text

Blog Entry for Assignment: Frequency Distributions and Data Analysis

The Program Below is the program I used to analyze the dataset. The code imports the dataset, selects relevant columns, and generates frequency distributions for three chosen variables.

import pandas as pd

#Load the dataset

file_path = r'C:\Users\kauanand\Downloads\gapminder.csv' data = pd.read_csv(file_path)

#Display the first few rows and column names to understand the dataset structure

print(data.head()) print(data.columns)

#Select relevant columns for frequency distributions

selected_columns = ['incomeperperson', 'alcconsumption', 'lifeexpectancy']

#Generate frequency distributions, including missing values

for column in selected_columns: print(f"Frequency Distribution for {column}:\n") print(data[column].value_counts(dropna=False)) print("\n")

Output:

country incomeperperson … employrate urbanrate 0 Afghanistan … 55.7000007629394 24.04 1 Albania 1914.99655094922 … 51.4000015258789 46.72 2 Algeria 2231.99333515006 … 50.5 65.22 3 Andorra 21943.3398976022 … 88.92 4 Angola 1381.00426770244 … 75.6999969482422 56.7

[5 rows x 16 columns] Index(['country', 'incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate'], dtype='object') Frequency Distribution for incomeperperson:

incomeperperson 23 6243.57131825833 1 268.259449511417 1 26551.8442381829 1 14778.1639288175 1 .. 13577.8798850901 1 20751.8934243568 1 5330.40161203986 1 1860.75389496662 1 320.771889948584 1 Name: count, Length: 191, dtype: int64

Frequency Distribution for alcconsumption:

alcconsumption 26 .1 2 .34 2 5.92 2 3.39 2 .. 12.14 1 3.11 1 11.01 1 10.71 1 4.96 1 Name: count, Length: 181, dtype: int64

Frequency Distribution for lifeexpectancy:

lifeexpectancy 22 73.979 2 72.974 2 81.097 1 62.465 1 .. 79.915 1 75.956 1 79.839 1 76.142 1 51.384 1 Name: count, Length: 190, dtype: int64

Here’s a breakdown of the results:

1. Frequency Distribution for incomeperperson:

The column incomeperperson contains continuous values, so you see several unique values with their counts. For example:

23 appears 1 time,

6243.57131825833 appears 1 time,

268.259449511417 appears 1 time,

... (and so on).

The incomeperperson column has 191 unique values as shown by Length: 191.

2. Frequency Distribution for alcconsumption:

The column alcconsumption also contains continuous data, with some common values appearing multiple times. For example:

26 appears 1 time,

.1 appears 2 times,

.34 appears 2 times,

5.92 appears 2 times,

... (and so on).

This column has 181 unique values as shown by Length: 181.

3. Frequency Distribution for lifeexpectancy:

The column lifeexpectancy contains values representing life expectancy, and you see unique values with their respective counts. For example:

22 appears 1 time,

73.979 appears 2 times,

72.974 appears 2 times,

81.097 appears 1 time,

... (and so on).

This column has 190 unique values as shown by Length: 190.

Summary of the frequency distributions based on the data for the selected variables:

Income per person (incomeperperson):

The values in this column are continuous and vary widely, with several unique values across different income levels.

Most values appear only once, indicating a diverse range of income per person across the countries in the dataset.

There are some repeated values, but they are relatively few, suggesting a wide spread of income levels.

Missing data is not explicitly shown in the frequency distribution, but you can check for NaN values using .isnull().sum() to confirm if any missing data exists.

Alcohol consumption (alcconsumption):

Similar to income, alcohol consumption values are mostly continuous, and the column contains various unique values for alcohol consumption levels.

Some values, like 0.1 and 5.92, are repeated multiple times, suggesting that these alcohol consumption levels are observed across multiple countries.

This column also contains a range of values, some of which may be missing or represented as NaN. To confirm this, you'd need to check for missing values.

Life expectancy (lifeexpectancy):

Life expectancy values also vary across the dataset, with many unique values, indicating differences in life expectancy among the countries.

Some life expectancy values, like 73.979 and 72.974, are repeated, which could represent several countries sharing the same life expectancy.

Missing data might be present, though the distribution suggests that life expectancy values are fairly well populated across the dataset.

In Conclusion:

All three variables contain a wide range of unique values, with a few repetitions, particularly in the cases of alcconsumption and lifeexpectancy, where certain values appear in multiple countries.

There is no immediate evidence of missing data in the frequency distributions, but further checks for NaN values can confirm this.

#datascience #data analysis #pandas #python

0 notes