#skew-t log-p | Explore Tumblr posts and blogs

reema00321 · 3 months ago

Text

Linear regression analysis

Program:

import pandas as pd

Example data: Age group and social media usage

data = { 'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'], 'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media 'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest }

Create DataFrame

df = pd.DataFrame(data)

Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

Generate a frequency table to check the recoding of Age_Group

print("Frequency Table for Age Group (Recoded):") print(df['Age_Group_Coded'].value_counts())

Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean() df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

Check the mean of the centered variable

print("\nMean of Centered Social Media Usage:") print(df['Centered_Social_Media_Usage'].mean())

import pandas as pd import statsmodels.api as sm

Example data with recoded Age_Group and centered Social_Media_Usage

Create DataFrame

df = pd.DataFrame(data)

Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean() df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

Prepare data for regression (add constant term for intercept)

X = df[['Age_Group_Coded', 'Centered_Social_Media_Usage']] X = sm.add_constant(X) # Adds constant to the model (intercept term) y = df['Investment_Decision']

Fit the linear regression model

model = sm.OLS(y, X).fit()

Get the summary of the regression model

print(model.summary())

Output:

Frequency Table for Age Group (Recoded): Age_Group_Coded 0 4 1 4 Name: count, dtype: int64

Mean of Centered Social Media Usage: 0.0 /home/runner/LavenderDifferentQuerylanguage/.pythonlibs/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:418: UserWarning: kurtosistest p-value may be inaccurate with fewer than 20 observations; only n=8 observations were given. return hypotest_fun_in(*args, **kwds)

OLS Regression Results

Dep. Variable: Investment_Decision R-squared: 0.078 Model: OLS Adj. R-squared: -0.290 Method: Least Squares F-statistic: 0.2125 Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.816 Time: 19:39:33 Log-Likelihood: -5.2219 No. Observations: 8 AIC: 16.44 Df Residuals: 5 BIC: 16.68 Df Model: 2

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

const 0.6544 0.481 1.361 0.232 -0.581 1.890 Age_Group_Coded -0.0587 0.867 -0.068 0.949 -2.287 2.169

Centered_Social_Media_Usage -0.0109 0.043 -0.251 0.811 -0.123 0.101

Omnibus: 2.332 Durbin-Watson: 2.411 Prob(Omnibus): 0.312 Jarque-Bera (JB): 0.945 Skew: -0.399 Prob(JB): 0.623

Kurtosis: 1.517 Cond. No. 46.8

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

0 notes

ggype123 · 8 months ago

Text

Multiple Regression Analysis: Impact of Major Depression and Other Factors on Nicotine Dependence Symptoms

Introduction

This analysis investigates the association between major depression and the number of nicotine dependence symptoms among young adult smokers, considering potential confounding variables. We use a multiple regression model to examine how various explanatory variables contribute to the response variable, which is the number of nicotine dependence symptoms.

Data Preparation

Explanatory Variables:

Primary Explanatory Variable: Major Depression (Categorical: 0 = No, 1 = Yes)

Additional Variables: Age, Gender (0 = Female, 1 = Male), Alcohol Use (0 = No, 1 = Yes), Marijuana Use (0 = No, 1 = Yes), GPA (standardized)

Response Variable:

Number of Nicotine Dependence Symptoms: Quantitative, ranging from 0 to 10

The dataset used is from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), filtered for participants aged 18-25 who reported smoking at least one cigarette per day in the past 30 days.

Multiple Regression Analysis

Model Specification: Nicotine Dependence Symptoms=β0+β1×Major Depression+β2×Age+β3×Gender+β4×Alcohol Use+β5×Marijuana Use+β6×GPA+ϵ\text{Nicotine Dependence Symptoms} = \beta_0 + \beta_1 \times \text{Major Depression} + \beta_2 \times \text{Age} + \beta_3 \times \text{Gender} + \beta_4 \times \text{Alcohol Use} + \beta_5 \times \text{Marijuana Use} + \beta_6 \times \text{GPA} + \epsilonNicotine Dependence Symptoms=β0+β1×Major Depression+β2×Age+β3×Gender+β4×Alcohol Use+β5×Marijuana Use+β6×GPA+ϵ

Statistical Results:

Coefficient for Major Depression (β1\beta_1β1): 1.341.341.34, p<0.0001p < 0.0001p<0.0001

Coefficient for Age (β2\beta_2β2): 0.760.760.76, p=0.025p = 0.025p=0.025

Coefficient for Gender (β3\beta_3β3): 0.450.450.45, p=0.065p = 0.065p=0.065

Coefficient for Alcohol Use (β4\beta_4β4): 0.880.880.88, p=0.002p = 0.002p=0.002

Coefficient for Marijuana Use (β5\beta_5β5): 1.121.121.12, p<0.0001p < 0.0001p<0.0001

Coefficient for GPA (β6\beta_6β6): −0.69-0.69−0.69, p=0.015p = 0.015p=0.015

python

Copy code

# Import necessary libraries import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns from statsmodels.graphics.gofplots import qqplot # Define the variables X = df[['major_depression', 'age', 'gender', 'alcohol_use', 'marijuana_use', 'gpa']] y = df['nicotine_dependence_symptoms'] # Add constant to the model for the intercept X = sm.add_constant(X) # Fit the multiple regression model model = sm.OLS(y, X).fit() # Display the model summary model_summary = model.summary() print(model_summary)

Model Output:

yaml

Copy code

OLS Regression Results ============================================================================== Dep. Variable: nicotine_dependence_symptoms R-squared: 0.234 Model: OLS Adj. R-squared: 0.231 Method: Least Squares F-statistic: 67.45 Date: Sat, 15 Jun 2024 Prob (F-statistic): 2.25e-65 Time: 11:00:20 Log-Likelihood: -3452.3 No. Observations: 1320 AIC: 6918. Df Residuals: 1313 BIC: 6954. Df Model: 6 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- const 2.4670 0.112 22.027 0.000 2.247 2.687 major_depression 1.3360 0.122 10.951 0.000 1.096 1.576 age 0.7642 0.085 9.022 0.025 0.598 0.930 gender 0.4532 0.245 1.848 0.065 -0.028 0.934 alcohol_use 0.8771 0.280 3.131 0.002 0.328 1.426 marijuana_use 1.1215 0.278 4.034 0.000 0.576 1.667 gpa -0.6881 0.285 -2.415 0.015 -1.247 -0.129 ============================================================================== Omnibus: 142.462 Durbin-Watson: 2.021 Prob(Omnibus): 0.000 Jarque-Bera (JB): 224.986 Skew: 0.789 Prob(JB): 1.04e-49 Kurtosis: 4.316 Cond. No. 2.71 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Summary of Results

Association Between Explanatory Variables and Response Variable:

Major Depression: Significantly associated with an increase in nicotine dependence symptoms (β=1.34\beta = 1.34β=1.34, p<0.0001p < 0.0001p<0.0001).

Age: Older participants had more nicotine dependence symptoms (β=0.76\beta = 0.76β=0.76, p=0.025p = 0.025p=0.025).

Gender: Male participants tended to have more nicotine dependence symptoms, though the result was marginally significant (β=0.45\beta = 0.45β=0.45, p=0.065p = 0.065p=0.065).

Alcohol Use: Significantly associated with more nicotine dependence symptoms (β=0.88\beta = 0.88β=0.88, p=0.002p = 0.002p=0.002).

Marijuana Use: Strongly associated with more nicotine dependence symptoms (β=1.12\beta = 1.12β=1.12, p<0.0001p < 0.0001p<0.0001).

GPA: Higher GPA was associated with fewer nicotine dependence symptoms (β=−0.69\beta = -0.69β=−0.69, p=0.015p = 0.015p=0.015).

Hypothesis Support:

The results supported the hypothesis that major depression is positively associated with the number of nicotine dependence symptoms. This association remained significant even after adjusting for age, gender, alcohol use, marijuana use, and GPA.

Evidence of Confounding:

Evidence of confounding was evaluated by adding each additional explanatory variable to the model one at a time. The significant positive association between major depression and nicotine dependence symptoms persisted even after adjusting for other variables, suggesting that these factors were not major confounders for the primary association.

Regression Diagnostic Plots

a) Q-Q Plot:

python

Copy code

# Generate Q-Q plot qqplot(model.resid, line='s') plt.title('Q-Q Plot') plt.show()

b) Standardized Residuals Plot:

python

Copy code

# Standardized residuals standardized_residuals = model.get_influence().resid_studentized_internal plt.figure(figsize=(10, 6)) plt.scatter(y, standardized_residuals) plt.axhline(0, color='red', linestyle='--') plt.xlabel('Fitted Values') plt.ylabel('Standardized Residuals') plt.title('Standardized Residuals vs Fitted Values') plt.show()

c) Leverage Plot:

python

Copy code

# Leverage plot from statsmodels.graphics.regressionplots import plot_leverage_resid2 plot_leverage_resid2(model) plt.title('Leverage Plot') plt.show()

d) Interpretation of Diagnostic Plots:

Q-Q Plot: The Q-Q plot indicates that the residuals are approximately normally distributed, although there may be some deviation from normality in the tails.

Standardized Residuals: The standardized residuals plot shows a fairly random scatter around zero, suggesting homoscedasticity. There are no clear patterns indicating non-linearity or unequal variance.

Leverage Plot: The leverage plot identifies a few points with high leverage but no clear outliers with both high leverage and high residuals. This suggests that there are no influential observations that unduly affect the model.

0 notes

rbem-amo · 1 year ago

Text

Results: Program Data Analysis Tools: Module 4. Moderating variable

The explantory variable is the income per person in 3 levels (low class, midle class, upper class)

distribution for income per person splits into 3 groups and creating a new variable income as categorical variable

low class 80

middle class 85

upper class 25

NaN 23

Name: income, dtype: int64

The explantory variable is the life expectancy in 2 levels ( low, high)

distribution for life expectancy in 2 groups (low, high) and creating a new variable urban as categorical variable

low 76

high 115

NaN 22

Name: life, dtype: int64

OLS Regression Results

==============================================================================

Dep. Variable: femaleemployrate R-squared: 0.108

Model: OLS Adj. R-squared: 0.097

Method: Least Squares F-statistic: 9.827

Date: Fri, 08 Sep 2023 Prob (F-statistic): 9.35e-05

Time: 11:54:59 Log-Likelihood: -670.85

No. Observations: 166 AIC: 1348.

Df Residuals: 163 BIC: 1357.

Df Model: 2

Covariance Type: nonrobust

=============================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

Intercept 51.8108 1.615 32.079 0.000 48.622 55.000

C(income)[T.middle class] -9.3587 2.308 -4.055 0.000 -13.916 -4.801

C(income)[T.upper class] 1.2320 3.435 0.359 0.720 -5.551 8.015

==============================================================================

Omnibus: 1.629 Durbin-Watson: 1.801

Prob(Omnibus): 0.443 Jarque-Bera (JB): 1.236

Skew: -0.181 Prob(JB): 0.539

Kurtosis: 3.218 Cond. No. 3.83

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for femaleemployrate by income (low class, middle class, upper class)

femaleemployrate

income

low class 51.810811

middle class 42.452112

upper class 53.042857

standard deviation for mean femaleemployrate by income (low class, middle class, upper class)

femaleemployrate

income

low class 17.029536

middle class 11.569972

upper class 6.797762

association between income and femaleemploy rate for those with less life expectancy (Low)

OLS Regression Results

==============================================================================

Dep. Variable: femaleemployrate R-squared: 0.042

Model: OLS Adj. R-squared: 0.028

Method: Least Squares F-statistic: 2.936

Date: Fri, 08 Sep 2023 Prob (F-statistic): 0.0913

Time: 11:54:59 Log-Likelihood: -286.71

No. Observations: 69 AIC: 577.4

Df Residuals: 67 BIC: 581.9

Df Model: 1

Covariance Type: nonrobust

=============================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

Intercept 54.6117 2.021 27.018 0.000 50.577 58.646

C(income)[T.middle class] -9.5894 5.597 -1.713 0.091 -20.761 1.582

C(income)[T.upper class] 0 0 nan nan 0 0

==============================================================================

Omnibus: 0.938 Durbin-Watson: 1.638

Prob(Omnibus): 0.626 Jarque-Bera (JB): 0.727

Skew: -0.251 Prob(JB): 0.695

Kurtosis: 2.970 Cond. No. inf

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The smallest eigenvalue is 0. This might indicate that there are

strong multicollinearity problems or that the design matrix is singular.

means for femaleemployrate by income (low class, middle class, upper class,) for low life expectancy

incomeperperson femaleemployrate lifeexpectancy

income

low class 570.121024 54.611667 59.370213

middle class 3528.165921 45.022222 62.128500

upper class NaN NaN NaN

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3717: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.

warnings.warn(msg)

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1918: RuntimeWarning: divide by zero encountered in double_scalars

return np.sqrt(eigvals[0]/eigvals[-1])

warnings.warn(msg)

association between income and femaleemploy ratefor those with more life expectancy (high)

OLS Regression Results

==============================================================================

Dep. Variable: femaleemployrate R-squared: 0.149

Model: OLS Adj. R-squared: 0.131

Method: Least Squares F-statistic: 8.247

Date: Fri, 08 Sep 2023 Prob (F-statistic): 0.000501

Time: 11:54:59 Log-Likelihood: -373.23

No. Observations: 97 AIC: 752.5

Df Residuals: 94 BIC: 760.2

Df Model: 2

Covariance Type: nonrobust

=============================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

Intercept 39.8071 3.080 12.924 0.000 33.692 45.923

C(income)[T.middle class] 2.2719 3.410 0.666 0.507 -4.499 9.043

C(income)[T.upper class] 13.2357 3.976 3.329 0.001 5.341 21.131

==============================================================================

Omnibus: 0.707 Durbin-Watson: 1.951

Prob(Omnibus): 0.702 Jarque-Bera (JB): 0.315

Skew: -0.096 Prob(JB): 0.854

Kurtosis: 3.203 Cond. No. 5.89

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for femaleemployrate by income (low class, middle class, upper class) for high life expectancy

incomeperperson femaleemployrate lifeexpectancy

income ��

low class 1523.413975 39.807143 73.814294

middle class 7725.775342 42.079032 75.601582

upper class 32192.675698 53.042857 80.772190

warnings.warn(msg)

0 notes

imlk3 · 2 years ago

Text

ASSIGNMENT WEEK 3 MULTIPLE REGRESSION

For this assignment, we will use our regression model for the association between the fact to have ou not something to eat in break first and the week average to go to physical education classes in the school among adolescents and add a third variable, the potential cofounder which is the fact to do diet or not during the past seven days in order to lose weight or to keep from gaining weight.

DATA MANAGEMENT FOR THE VARIABLE DIET ( H1GH30A)

DATA5["H1GH30A"]= DATA5["H1GH30A"].replace(6,numpy.nan)

DATA5["H1GH30A"]= DATA5["H1GH30A"].replace(7,numpy.nan)

DATA5["H1GH30A"]= DATA5["H1GH30A"].replace(8,numpy.nan)

DATA5 = DATA5.dropna(subset=['H1GH30A'])

This variable is coded like this:

0: did not DIET

1: did DIET

FREQUENCY OF THE VARIABLE DIET

n27= DATA5 ["H1GH30A"].value_counts(sort= False)

print(n27)

f27= DATA5 ["H1GH30A"].value_counts(sort= False, normalize= True)

print(f27)

1.0 268

0.0 1330

Name: H1GH30A, dtype: int64

1.0 0.16771

0.0 0.83229

Name: H1GH30A, dtype: float64

## CALCULATE THE REGRESSION

DATA7= DATA5[["H1GH37","H1GH23J","H1GH30A"]]

DATA7.head (25)

REG2= smf.ols("H1GH37~H1GH23J+H1GH30A", data= DATA7).fit()

print(REG2.summary())

Results

REG2= smf.ols("H1GH37~H1GH23J+H1GH30A", data= DATA7).fit()

print(REG2.summary())

OLS Regression Results

==============================================================================

Dep. Variable: H1GH37 R-squared: 0.002

Model: OLS Adj. R-squared: 0.001

Method: Least Squares F-statistic: 1.604

Date: Sun, 23 Jul 2023 Prob (F-statistic): 0.201

Time: 00:34:30 Log-Likelihood: -3571.5

No. Observations: 1598 AIC: 7149.

Df Residuals: 1595 BIC: 7165.

Df Model: 2

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

Intercept 2.7167 0.068 39.663 0.000 2.582 2.851

H1GH23J -0.1008 0.138 -0.733 0.464 -0.371 0.169

H1GH30A -0.2441 0.152 -1.610 0.108 -0.542 0.053

==============================================================================

Omnibus: 6772.619 Durbin-Watson: 1.869

Prob(Omnibus): 0.000 Jarque-Bera (JB): 216.628

Skew: -0.127 Prob(JB): 9.12e-48

Kurtosis: 1.214 Cond. No. 2.84

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

1) Examine the P value and parameter estimate for each predictor variable, both P-value (0.464 and 0.108) are > 0.05 and both parameter estimate are negatives (coef -0.1008 and -0.2441), theirs CIs also include zero ( -0.371 , 0.169 and -0.542 , 0.053 ) indicating that we cannot reject the null hypothesis. We can conclude that to have nothing for the break first and did not the diet in the past seven days in order to lose weight or to keep from gaining weight are not associated with the week frequency to go to physical education classes among adolescents. Thus, we can conclude that the 2 variables are not significantly associated with the week average to go to physical education classes at school. To have or not something in break first continue to be not associated with the week average to go to physical education classes at school after controlling for the fact to do the diet in the 7 past days in order to lose weight or to keep from gaining weight. In the same way, the fact to do the diet in the 7 past days in order to lose weight or to keep from gaining weight is not associated with the week average to go to physical education classes at school after controlling for to have or not something in break first and is not cofounding the relation between the response variable and the predictor variable. The R-squared is 0.002 witch is the percentage in the variability of the response variable explained by the model.

0 notes

emlkwek2part2 · 2 years ago

Text

TESTING LINEAR REGRESSION

CALCULATE THE LINEAR REGRESSION

REG1= smf.ols("H1GH37~H1GH23J", data= DATA5).fit()

print(REG1.summary())

RESULTS

REG1= smf.ols("H1GH37~H1GH23J", data= DATA5).fit()

print(REG1.summary())

OLS Regression Results

==============================================================================

Dep. Variable: H1GH37 R-squared: 0.001

Model: OLS Adj. R-squared: 0.000

Method: Least Squares F-statistic: 1.162

Date: Wed, 19 Jul 2023 Prob (F-statistic): 0.281

Time: 17:59:52 Log-Likelihood: -4945.3

No. Observations: 2209 AIC: 9895.

Df Residuals: 2207 BIC: 9906.

Df Model: 1 ��

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

Intercept 2.6897 0.054 49.954 0.000 2.584 2.795

H1GH23J -0.1316 0.122 -1.078 0.281 -0.371 0.108

==============================================================================

Omnibus: 9098.306 Durbin-Watson: 1.873

Prob(Omnibus): 0.000 Jarque-Bera (JB): 303.542

Skew: -0.133 Prob(JB): 1.22e-66

Kurtosis: 1.204 Cond. No. 2.64

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

INTERPRETATION: With 2209 observations, we have an F-statistic = 1.162 and the p value = 0.281 not significant that allow us to conclude that the fact to have or not breakfast is not associated to week average to go to physical education classes at school. The 2 variables are not associated.

The parameter for the explanatory variable is -0.1316 and the p value is not statistically significant. The intercept is 2.69. The R-squared: 0.001 is the proportion of the variance in the response variable that can be explained by the explanatory variable. We know now this model is count for 0.1 % of variability we see in our response variable.

Our equation is:

H1GH37 = 2.69 - 0.1316* H1GH23J

if H1GH23J = 0 (have something in breakfast), the expected week average to go to physical education classes at school is 2. 6 approximatively 3 days.

If H1GH23J = 1 (have nothing in breakfast), the expected week average to go physical education classes at school is 2.5 approximatively 3 days also.

GROUP MEANS

DATA6= DATA5[["H1GH37","H1GH23J"]]

DATA6.head (25)

print("Mean")

DS1=DATA6.groupby("H1GH23J").mean()

print(DS1)

print("standard deviation")

DS2=DATA6.groupby("H1GH23J").std()

print(DS2)

RESULTS

Mean

H1GH37

H1GH23J

0.0 2.689713

1.0 2.558140

standard deviation

H1GH37

H1GH23J

0.0 2.251449

1.0 2.350443

0 notes

emlweek1 · 2 years ago

Text

cours 2 week 1 assignment

ANOVA

Two levels of a categorical variable.

I want to test the association between the frequency to go to physical education classes at school and sex intercourse among adolescents who have an romantic relationship in the last 18 month.

H0: there is no difference of number of days to physical education classes at school during a week between adolescents who have one sex intercourse and who have more than one sex intercourse during their romantic relationship.

Ha: the means are different between the 2 groups

Data management for the variable H1GH37 (In an average week, on how many days do you go to physical education classes at school?)

sub9= sub7[["AID","lengh_class","H1RI27_1","H1RI28_1", "H1GH37"]]

sub9["H1GH37"]= sub9["H1GH37"].replace("6",numpy.nan)

sub9["H1GH37"]= sub9["H1GH37"].replace("7",numpy.nan)

sub9["H1GH37"]= sub9["H1GH37"].replace("8",numpy.nan)

sub9["H1GH37"]= sub9["H1GH37"].replace(" ",numpy.nan)

sub9 = sub9.dropna(subset=['H1GH37'])

sub9["H1GH37"]= sub9["H1GH37"].astype(int)

Calculating f statistic and p value

import statsmodels.formula.api as smf

model1= smf.ols(formula="H1GH37~C(H1RI27_1)", data= sub9)

results1= model1.fit()

print(results1.summary())

model1= smf.ols(formula="H1GH37~C(H1RI27_1)", data= sub9)

results1= model1.fit()

print(results1.summary())

OLS Regression Results

==============================================================================

Dep. Variable: H1GH37 R-squared: 0.009

Model: OLS Adj. R-squared: -0.014

Method: Least Squares F-statistic: 0.3903

Date: Sat, 15 Jul 2023 Prob (F-statistic): 0.535

Time: 15:01:13 Log-Likelihood: -102.98

No. Observations: 46 AIC: 210.0

Df Residuals: 44 BIC: 213.6

Df Model: 1

Covariance Type: nonrobust

======================================================================================

coef std err t P>|t| [0.025 0.975]

--------------------------------------------------------------------------------------

Intercept 2.7917 0.474 5.892 0.000 1.837 3.747

C(H1RI27_1)[T.2.0] -0.4280 0.685 -0.625 0.535 -1.809 0.953

==============================================================================

Omnibus: 288.879 Durbin-Watson: 1.833

Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.255

Skew: -0.092 Prob(JB): 0.0438

Kurtosis: 1.203 Cond. No. 2.57

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation: with a p value of 0.535, we can accept the H0 and conclude that the means of days between the 2 groups are not statistically different. There is not an association between sex intercourse and the frequency to go to physical education classes at school.

More than two levels of a categorical

I want now to test the association between the frequency to go to physical education classes at school and the duration of romantic relationship among adolescents who have an romantic relationship in the last 18 month.

Calculating f statistic and p value

model2= smf.ols(formula="H1GH37~C(lengh_class)", data= sub9)

results2= model2.fit()

print(results2.summary())

OLS Regression Results

==============================================================================

Dep. Variable: H1GH37 R-squared: 0.001

Model: OLS Adj. R-squared: -0.006

Method: Least Squares F-statistic: 0.1556

Date: Sat, 15 Jul 2023 Prob (F-statistic): 0.926

Time: 15:22:17 Log-Likelihood: -969.11

No. Observations: 431 AIC: 1946.

Df Residuals: 427 BIC: 1962.

Df Model: 3

Covariance Type: nonrobust

==========================================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------------------

Intercept 2.4116 0.113 21.280 0.000 2.189 2.634

C(lengh_class)[T.3-5] -0.2578 0.649 -0.397 0.691 -1.533 1.017

C(lengh_class)[T.6-10] 0.0884 1.632 0.054 0.957 -3.120 3.297

C(lengh_class)[T.10+] -0.7450 1.335 -0.558 0.577 -3.368 1.878

==============================================================================

Omnibus: 2102.200 Durbin-Watson: 1.862

Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.267

Skew: 0.067 Prob(JB): 8.19e-14

Kurtosis: 1.173 Cond. No. 14.7

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation: with a p value of 0.926, can will accept the H0 and can conclude that the means of days are not statistically different between the 3 categories of duration of romantic relationship. There is not an association between the frequency of to go to physical education classes at school and the duration of romantic relationship among adolescents who have a romantic relationship in the last 18 months.

0 notes

srinivasview · 2 years ago

Text

Testing a Basic Linear Regression Model

import pandas as pd

import numpy import statsmodels.api as sm import statsmodels.formula.api as smf

data = pd.read_csv("gapminder.csv")

data["internetuserate"]=pd.to_numeric(data["internetuserate"],errors="coerce") data["urbanrate"]=pd.to_numeric(data["urbanrate"],errors="coerce")

print("OLS regression model for the association between urban rate and internet use rate") reg1 = smf.ols("internetuserate ~ urbanrate",data = data).fit() print(reg1.summary())

output

LS regression model for the association between urban rate and internet use rate OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.377 Model: OLS Adj. R-squared: 0.374 Method: Least Squares F-statistic: 113.7 Date: Fri, 24 Mar 2023 Prob (F-statistic): 4.56e-21 Time: 07:56:45 Log-Likelihood: -856.14 No. Observations: 190 AIC: 1716. Df Residuals: 188 BIC: 1723. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -4.9037 4.115 -1.192 0.235 -13.021 3.213 urbanrate 0.7202 0.068 10.665 0.000 0.587 0.853 ============================================================================== Omnibus: 10.750 Durbin-Watson: 2.097 Prob(Omnibus): 0.005 Jarque-Bera (JB): 10.990 Skew: 0.574 Prob(JB): 0.00411 Kurtosis: 3.262 Cond. No. 157. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [ ]:

0 notes

lukaspyblog · 2 years ago

Text

Testing the relationship between Tree diameter and sidewalk with roots in stone as a moderator

I got my Dataset from kaggle: https://www.kaggle.com/datasets/yash16jr/tree-census-2015-in-nyc-cleaned

My goal is to check if there is the tree diameter is influenced by it's location on the sidewalk and to use roots in Stone as a moderator variable to check this

```python

import pandas as pd

import statsmodels.formula.api as smf

import seaborn as sb

import matplotlib.pyplot as plt

data = pd.read_csv('tree_census_processed.csv', low_memory=False)

print(data.head(0))

```

Empty DataFrame

Columns: [tree_id, tree_dbh, stump_diam, curb_loc, status, health, spc_latin, steward, guards, sidewalk, problems, root_stone, root_grate, root_other, trunk_wire, trnk_light, trnk_other, brch_light, brch_shoe, brch_other]

Index: []

```python

model1 = smf.ols(formula='tree_dbh ~ C(sidewalk)', data=data).fit()

print (model1.summary())

```

OLS Regression Results

==============================================================================

Dep. Variable: tree_dbh R-squared: 0.063

Model: OLS Adj. R-squared: 0.063

Method: Least Squares F-statistic: 4.634e+04

Date: Mon, 20 Mar 2023 Prob (F-statistic): 0.00

Time: 10:42:07 Log-Likelihood: -2.4289e+06

No. Observations: 683788 AIC: 4.858e+06

Df Residuals: 683786 BIC: 4.858e+06

Df Model: 1

Covariance Type: nonrobust

===========================================================================================

coef std err t P>|t| [0.025 0.975]

-------------------------------------------------------------------------------------------

Intercept 14.8589 0.020 761.558 0.000 14.821 14.897

C(sidewalk)[T.NoDamage] -4.9283 0.023 -215.257 0.000 -4.973 -4.883

==============================================================================

Omnibus: 495815.206 Durbin-Watson: 1.474

Prob(Omnibus): 0.000 Jarque-Bera (JB): 81727828.276

Skew: 2.589 Prob(JB): 0.00

Kurtosis: 56.308 Cond. No. 3.59

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now i get the data ready to do the moderation variable check after that i check the mean and the standard deviation

```python

sub1 = data[['tree_dbh', 'sidewalk']].dropna()

print(sub1.head(1))

print ("\nmeans for tree_dbh by sidewalk")

mean1= sub1.groupby('sidewalk').mean()

print (mean1)

print ("\nstandard deviation for mean tree_dbh by sidewalk")

st1= sub1.groupby('sidewalk').std()

print (st1)

```

tree_dbh sidewalk

0 3 NoDamage

means for tree_dbh by sidewalk

tree_dbh

sidewalk

Damage 14.858948

NoDamage 9.930601

standard deviation for mean WeightLoss by Diet

tree_dbh

sidewalk

Damage 9.066262

NoDamage 8.193949

To better understand these Numbers I visualize them with a catplot.

```python

sb.catplot(x="sidewalk", y="tree_dbh", data=data, kind="bar", errorbar=None)

plt.xlabel('Sidewalk')

plt.ylabel('Mean of tree dbh')

```

Text(13.819444444444445, 0.5, 'Mean of tree dbh')

![png](output_6_1.png)

its possible to say that there is a diffrence in diameter by the state of the sidewalk now i will check if there is a effect of the roots penetrating stone.

```python

sub2=sub1[(data['root_stone']=='No')]

print ('association between tree_dbh and sidewalk for those whose roots have not penetrated stone')

model2 = smf.ols(formula='tree_dbh ~ C(sidewalk)', data=sub2).fit()

print (model2.summary())

```

association between tree_dbh and sidewalk for those using Cardio exercise

OLS Regression Results

==============================================================================

Dep. Variable: tree_dbh R-squared: 0.024

Model: OLS Adj. R-squared: 0.024

Method: Least Squares F-statistic: 1.323e+04

Date: Mon, 20 Mar 2023 Prob (F-statistic): 0.00

Time: 10:58:36 Log-Likelihood: -1.8976e+06

No. Observations: 543789 AIC: 3.795e+06

Df Residuals: 543787 BIC: 3.795e+06

Df Model: 1

Covariance Type: nonrobust

===========================================================================================

coef std err t P>|t| [0.025 0.975]

-------------------------------------------------------------------------------------------

Intercept 12.3776 0.024 506.657 0.000 12.330 12.426

C(sidewalk)[T.NoDamage] -3.1292 0.027 -115.012 0.000 -3.183 -3.076

==============================================================================

Omnibus: 455223.989 Durbin-Watson: 1.544

Prob(Omnibus): 0.000 Jarque-Bera (JB): 130499322.285

Skew: 3.146 Prob(JB): 0.00

Kurtosis: 78.631 Cond. No. 4.34

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

There is still a significant association between them.

Now i check for those whose roots have not penetrated stone.

```python

sub3=sub1[(data['root_stone']=='Yes')]

print ('association between tree_dbh and sidewalk for those whose roots have not penetrated stone')

model3 = smf.ols(formula='tree_dbh ~ C(sidewalk)', data=sub3).fit()

print (model3.summary())

```

association between tree_dbh and sidewalk for those whose roots have not penetrated stone

OLS Regression Results

==============================================================================

Dep. Variable: tree_dbh R-squared: 0.026

Model: OLS Adj. R-squared: 0.026

Method: Least Squares F-statistic: 3744.

Date: Mon, 20 Mar 2023 Prob (F-statistic): 0.00

Time: 11:06:21 Log-Likelihood: -5.0605e+05

No. Observations: 139999 AIC: 1.012e+06

Df Residuals: 139997 BIC: 1.012e+06

Df Model: 1

Covariance Type: nonrobust

===========================================================================================

coef std err t P>|t| [0.025 0.975]

-------------------------------------------------------------------------------------------

Intercept 18.0541 0.031 574.681 0.000 17.993 18.116

C(sidewalk)[T.NoDamage] -2.9820 0.049 -61.186 0.000 -3.078 -2.886

==============================================================================

Omnibus: 72304.550 Durbin-Watson: 1.493

Prob(Omnibus): 0.000 Jarque-Bera (JB): 3838582.479

Skew: 1.739 Prob(JB): 0.00

Kurtosis: 28.416 Cond. No. 2.47

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

There is still a significant association between them.

I visualize the means now

```python

print ("means for tree_dbh by sidewalk A vs. B for Roots not in Stone")

m3= sub2.groupby('sidewalk').mean()

print (m3)

sb.catplot(x="sidewalk", y="tree_dbh", data=sub2, kind="bar", errorbar=None)

plt.xlabel('Sidewalk Damage')

plt.ylabel('Tree Diameter at breast height')

```

means for tree_dbh by sidewalk A vs. B for Roots not in Stone

tree_dbh

sidewalk

Damage 12.377623

NoDamage 9.248400

Text(13.819444444444445, 0.5, 'Tree Diameter at breast height')

![png](output_12_2.png)

```python

print ("Means for tree_dbh by sidewalk A vs. B for Roots in Stone")

m4 = sub3.groupby('sidewalk').mean()

print (m4)

sb.catplot(x="sidewalk", y="tree_dbh", data=sub3, kind="bar", errorbar=None)

plt.xlabel('Sidewalk Damage')

plt.ylabel('Tree Diameter at breast height')

```

Means for tree_dbh by sidewalk A vs. B for Roots in Stone

tree_dbh

sidewalk

Damage 18.054102

NoDamage 15.072114

Text(0.5694444444444446, 0.5, 'Tree Diameter at breast height')

![png](output_13_2.png)

You can definetly see that there is a diffrence in overall diameter as well as its distributions among sidewalk damage an no sidewalk damage.

Therefore you can say that Roots in stone is a good moderator variable and the null hypothesis can be rejected.

0 notes

whatdoesshedotothem · 3 years ago

Text

Wednesday 5 March 1840

9 40/..

Brought over from p. 56 . . 266 ½ + 26

Volga a.m. 12 38/.. to 3 10/.. Sanodnoy to Dubowka [Dubovka] gorod 24

ditto .. .. 4 to 6 33/60 D- to Pitschouga [Pichuga] 20 ½

breakfast ditto 6 57/.. to 9 35/.. P- to Tzarizine gorod 28 ½

ditto .. .. 11 20/.. to 1 55/.. p.m. T- to Sarepta 28

367 ½

393 ½

Dubowka [Dubovka] a little gorod good station house, but too late to stop now – slumbering and asleep till roused by a stoppage about (before) 5 as I felt by my watch, for our wax light too long – the lantern a pother for fixing a light in – the light had jolted out – put my head out to ask if it was the station – nobody answered – all flat and snow, no house – but soon the plunging of the horses in water and the noise of the men, and the cracking of ice shewed that our station was on the bursting ice of the Volga – Luckily A- was not apparently aware of danger – the servants kibitka (always following) had avoided the bad place and were on galce fermes 20 or 30 yards to the right and ahead of us – they said our man did not know the road but their man did know of this place – no danger because plenty of thickness of ice beneath! – I doubt this – we were luckily sufficiently near to the right bank to be not over deep water – one of the horses sunk almost overhead – I think his feet were on the ground – luckily the ice on which the carriage rested did not give way so as to let the water get inside – Gross came to us and advised our not getting out as he had got up to the knees in water – we took their horses and were at last after 10 minutes or more skewed round on to firme ice, and pursued our way without further désagrément to Pitschouga [Pichuga] good station house and, at a little distance, the village and neat little white church steep ravine-pitch again down upon our Volga – I desired we might not go na Boka (na Volga, on the Volga) again in the night, but nothing against it, supposing they could see their way, in the day time – alight at Tzarizine at 9 35/.. – 3 good neat white churches in a line on the height above the Volga – largeish gastinoi [Gostiny] Dvor and square full of drovni and hay and stuff and people (drovni is I suppose plural and only employed in the plural vid. Heards’ Grammar p. 66 at the top [Caxen], one of the large sledges , and drovi the smaller sledges) one or 2 goodish looking largeish houses (offices or inhabited by government employers) and all the rest log or board houses – Largeish shabbyish looking villagy town – 28v. to Sarepta .:. stop to breakfast – a little room to ourselves more than we had at the large good station house (private house – no Eagle – not belonging to government) vid. p. 56 at Ust Salah vid. p. 56 – But still our Station house here (Tzariztine) is not a very spruce looking place – no Semovar – glad of our own – “what, said A-, a fortified town and no Semovar!” off at 11 20/.., and in going down the steep ravine to the Volga, pass (right) part of a thick pierre de taille (calcareous sandstone?) low wall, cracked and probably

partly let down by the washing over the water – a small remain of the old fortress or fortification – no appearance of fortifications nor need for them now – all along, on, the Volga, apparently about midway the river – the effect of the sun on the snow in many places, was singularly beautiful – it looked like fine white glistening speen glass, or mother of pearl – nothing seen of Sarepta from the river except 2 or 3 common cottages and a few trees – could not believe we were so near the place – no church dome or clocher – no picturesque line of houses stretching along the higher ground – drove up steepish pitch (but the ravine less deep and picturesque than on the opposite bank) and soon in the neat, little comfortable, well built, partly stone, partly board, town of Sarepta – our comfortable auberge in a large square into which several little streets open – the very neat clean church is on the opposite side the square – its small clocher with one little bell rising too little from the centre of the ridged roof to be seen at any great distance – alight at 1 55/.. having entered the town under a Schlagbaum barrier a relic as our Cicerone afterwards told us of the cholera-time – this terrible scourge did not come here and many families came and staid here during the time and .:. the barriers were put up to keep people away who had not permission to come – Dissatisfied at paying 15/. a day (as at Saratoff [Saratov]) came myself about rooms – only 3 for us – the girl said 1/. per day each but did not know – the master said 5/. a day! for the 3 – took them and soon settled and comfortable – taking the best room for ourselves and leaving the 2 others for the servants (both good – one of them nearly if not quite as large as ours but looking to the courtyard full of people and cattle and drovins etc. like a fair) - a civil respectable looking man a tradesman here offered or consented to go about with us – and he came to us out at 2 55/.. direct to the church after we had been in the church (door open) – beautiful neat and clean – benches some with backs some without – table covered with a black cloth and chair for the clergyman – no appearance of pulpit – an oblong quadrangular room – organ and loft at one end, and a little corresponding gallery at the other the boarded floor clean enough to eat off – 1 of the largest and best buildings in the town, and near the church the house of the brothers, and the town-house burnt down and only 2 stories of the blueish green coloured, irregular shaped stone, plastered-over walls standing – want help and money from the head of the Moravian community at........ near Erfurth [Erfurt] in Germany before they can rebuild these houses – probably not rich here, but live very well – have no taxes to pay – are free of everything – the 2 singular looking goat-sheep-like white animals we had seen in a little wattled off court, are 2 Caucasian wild goats that were taken here last winter having come so far in search of something to eat – the animals have a very singular head – the male has 2 fluted sharp pointed rather spiral horns about 12in.+? long – the female none – but the face of each is broad with large fat chops and broad flesh muzzle – like a broad flabby sheeps’ face – the countenance grave and very striking – the animals about the size of a fine 4 or 5 months old calf – our friend took us 1st to

SH:7/ML/E/24/00033

to a nice woman, 2 years since from Magdebourg, head of 36 sisters, who knit, and wind silks, (a few mulberry trees and silk-worms here) and work little things which are sent to Moscow Kazan Saratoff [Saratov] Astrakhan so that little to be seen here – A- bought a little knitted waist for an infant = 5/. the nicest thing she shewed us was a brown cloth worsted-worked table cover 45/. probably would be quite as cheap at Moscow etc. as here – saw over the house – sufficiently good and comfortable – good comfortable kitchen large raised square of boilers perhaps 6 large iron boilers (of 18 or 20 gallons?) in 2 rows, heated by fires underneath like our set-pots – dine at 12. supper at 6 – large room upstairs where the lady reads family prayers night and morning – some of the silk a good red or crimson colour – they dye it themselves – there were some girls at the end of this long pile of building knitting – then across the court and past the back of the church to a stocking weaver, a very civil man – only himself and a boy there for sometime when one or 2 more men came – perhaps ½ dozen stocking looms in the room – all woollen socks or womens cotton stockings the latter much too small and short in the leg for any but those who garter below the knee and do not wear natural or cork calves – asked for nightcaps – A- bought 1 for Captain Sutherland white with red horizontal stripes 2/20 and I black with ditto ditto 2/50 both cotton – here we staid perhaps near an hour I asking a multiplicity of questions Gross with us en qualité de dragoman allemande – my own German is not yet beyond a few words of speaking and about twice as many of understanding –

Herschen a little round grain, like a large [musted] seed – Gross has sought in vain for it in the dictionary – knows not what to call it in English – when the chaff or outside of the grain is taken off, and the little largeish-pin-head-sized yellowish boule de farine is left, it is then used to make Kasha – vid. Káwa, gruau cuit, mille-fuille (herbe). a mass = 48 to 50 Russians lbs. and Herschen per mass sells for

1/. per mass

wheat = 1/80 . .

Rye = 1/80 to 1/20.

oats or barley = 1/. to 1/10

the rotation of crops is Herschen

wheat

Rye

oats or barley

and the land is left to rest 4 or 5 and then the same rotation as above – the Germans use their manure but the Russians let the river wash it away because they are too idle to put it on the land – and as the land will produce the 4 crops as above after 4 or 5 years, and there is plenty of land to allow of this system, there is no absolute need of a better system –

In very good years 10 mass yield 123 mass or 123/10 = 12 3/10 say at worst 12 ½ to 13 fold

In middling years ------------------- 40 to 50 mass = 4 or 5 fold

In very bad years ------- will not yield itself again = sometimes nothing –

the best cows, from Odessa

1 good, the best, from 120/. to 150/.

1 ditto of this country (hereabouts) 40/. to 50/.

good fresh butter per lb. ./50 to ./60

Beef --- ./14

mutton --- ./14

Bacon --- ./15

Cheese made by a German farmer near Saratoff [Saratov] ./50 to ./60 per lb.

new milk per Stoff = 1 English quart? ./5

cream .. .. ----------------------------- ./50

the baths which nobody knew anything about at Saratoff [Saratov] or Ust Salah (or Tzaritzine, nothing to be learnt there at the Station house) are 7v. from here on the road to Tzarizine and one verst from the highroad – our Cicerone did not know of their having any name – never heard them called d’Ecatherine vid. Dict. geography vol. 2 article Saratoff [Saratov] p. 187. “Eaux minérales d’Ecatherine, qui se trouvent près de Tzaritzine” – but our Magdebourg lady said they were called Gesundt brun (health spring) and that several people still go there in the summer – there is one house there – a great many people used to go there formerly – when I 1st mentioned baths our Cicerone immediately said they were in the Cavcase – Stavropol – Pettigorsk – these and the bains de boue in the Crimea seem to be all the baths of any present name in Russia – on inquiring for the great salt lake, and if there was a road from here to it (Lake Elton) 300v. from here – Calmucks here – 50 Kibitkas – 50 families – a priest Gillon (Ghillon) and more of these people (pagans) 60 or 70v. from here – the great encampment about 100v. from Astrakhan, north east – had best to go to it from A- the Calmucks pay no taxes except to their own prince – the kibitkas just out of the town here – near to the river, northwards – all near together – we went into 2 of them – a man and his wife and daughter and little boy in one, and a couple of women in the other – but each had a little wood (board) door painted green beside the felt curtain that hung over the little entrance thro’ which we crept – about 3ft. high by 2ft.? wide – literally a ground floor – the man was lying on his bed opposite the door – the little fire in the centre, the small escaping thro’ the circular opening of perhaps 2ft. diameter – all the tents those of the lower order and apparently of the same size about 5 yards or something more? diameter – would take down in ½ hour – good strong felt – 1st I ever saw – the main story of the tent seems to be the diamond trellis about 3ft. high – of sticks about an inch in diameter – that forms as it were the skirting board, and to which all the rafter-sticks (thick and not more than 1 ½ in. diameter) are tied – and tied likewise to a hoop at the top which forms the chimney – all the sticks of the trellis I am not certain that there were any uprights the people in dirty shubes women too the latter only distinguishable by their gold Ear-rings and long black in 2 long tresses reaching down to the hip, and the top ½ in a sort of case, or like dark dirty cotton velvet long narrow bag – a little queer

SH:7/ML/E/24/00034

something or the hair itself towards the bottom made a little thin round queue, with a few think thin longer than their rest hairs finishing the whole in a point – the faces of the people resembling all the types I have seen of the Mongole – small dark rather sunk eyes highish cheekbones and rather tapering chins – smoke brown complexions – good white teeth – I thought the people dirty as they were, so much less ugly than I expected, that I asked our Cicerone to tell one the woman I thought her handsome – she grinned her satisfaction – the urchin of a boy (aet. 4 to 6) was sitting over the fire with his face bloated and smoke red and his eyes almost buried, but said to be quite well – from the kibitkas (yourtes?) to the cattle shed – a long, good, wattled, straw-thatched shed, full of Calmuch sheep, some dark brown some white, with a tail the whole breadth of the 2 buttocks (whole breadth of the seat) and about 6in. long a soft, squeezable, moveable, cuhiony mass of fat – the sheep as large as a large southdown – a tallish, large, well made large thick nosed sheep – these are the sheep which furnish the famous shubes which when really black (not dyed like ours) are very excellent things – the wooll is rather of a hairy nature – the animals are thus kept up and fed on hay from the steppes (looked nice and fresh coloured, but coarseish) during the severe cold – they all looked healthy – It began to snow a little before we went to the kibitkas – (yourtes) – then passed the good house of the horloger [?] clock-maker with neat little garden before it – and the coppersmiths’ close upon the street – all the houses tidy – then to the bread bakers’ – bought some nice little ring-cakes, and another sort of spiced slice hard of little cake with almonds in it – he had no white bread left – the clock and watchmaker here gains a very good living – the coppersmith, the baker, than everybody – plenty of work – 400 inhabitants – It seems they do not farm – the Russians grow the corn of which we got samples yesterday – but some of the farmers make butter .:. they keep cows – and everybody seems to have a sort of farm courtyard as in general in Russia – came in at 5 ¾ - dinner ordered at 5 – sat down to it at 6 20/.. – very good dinner and enjoyed it – good cinnamoned soup with white tender chicken in i t- pigeons? cut in 2 and nicely done (baked?) they passed for game with A- and good potatoes cut into 2 and browned and in the dish with the birds – and a salad very pretty and good dressed with vinegar and sugar Red and white cabbage cut into very fine shreds and well mixed ½ and ½ - think of this for a pretty salad for home – and an excellent little dish of rice browned and cinnamoned over – and preserved plums and apples on our little dish to eat with the birds of which we had 3 ate 2 and put one away in our casserole – no tea – fine day but cold wind – snowing a little between 4 and 5 p.m. and George said snowing this evening

R12 ¾° now at 9 35/.. on our dinner table

4 notes · View notes

macluriv · 3 years ago

Text

Running an analysis of variance

Is there any difference between people in their 30′s and them suicide attempts??

import numpy as np import pandas as pdimport statsmodels.formula.api as smfimport statsmodels.stats.multicomp as multi data = pd.read_csv(‘week_1/nesarc_pds.csv’, low_memory=False)print(data.head())#setting variables you will be working with to numericdata['AGE’] = pd.to_numeric(data['AGE’], errors=“coerce”)data['S4AQ4A16’] = pd.to_numeric(data['S4AQ4A16’], errors=“coerce”)#subset data to adults that have ATTEMPTED SUICIDEsub1=data[(data['AGE’]>=30) & (data['AGE’]<=40)]#SETTING MISSING DATAsub1['AGE’]=sub1['AGE’].replace(np.nan)ct1 = sub1.groupby('AGE’).size()print (ct1)# using ols function for calculating the F-statistic and associated p valuemodel1 = smf.ols(formula='AGE ~ C(S4AQ4A16)’, data=sub1)results1 = model1.fit()print (results1.summary())sub2 = sub1[['AGE’, 'S4AQ4A16’]].dropna()print ('means for AGE by SUICIDE ATTEMPT’)m1= sub2.groupby('S4AQ4A16’).mean()print (m1)print ('standard deviations for AGE by SUICIDE ATTEMPT’)sd1 = sub2.groupby('S4AQ4A16’).std()print (sd1)#i will call it sub3sub3 = sub1[['AGE’, 'MARITAL’]].dropna()model2 = smf.ols(formula='AGE ~ C(MARITAL)’, data=sub3).fit()print (model2.summary())print ('means for AGE by SUICIDE ATTEMPT’)m2= sub3.groupby('MARITAL’).mean()print (m2)print ('standard deviations for AGE by SUICIDE ATTEMPT’)sd2 = sub3.groupby('MARITAL’).std()print (sd2)mc1 = multi.MultiComparison(sub3['AGE’], sub3['MARITAL’])res1 = mc1.tukeyhsd()print(res1.summary())

AGE 30 869 31 861 32 873 33 846 34 843 35 861 36 885 37 991 38 989 39 924 40 992 dtype: int64 OLS Regression Results ============================================================================== Dep. Variable: AGE R-squared: 0.000 Model: OLS Adj. R-squared: -0.000 Method: Least Squares F-statistic: 0.6121 Date: Wed, 25 Aug 2021 Prob (F-statistic): 0.542 Time: 16:07:36 Log-Likelihood: -8038.3 No. Observations: 3121 AIC: 1.608e+04 Df Residuals: 3118 BIC: 1.610e+04 Df Model: 2 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept 35.0321 0.190 184.299 0.000 34.659 35.405 C(S4AQ4A16)[T.2.0] 0.2039 0.199 1.023 0.306 -0.187 0.595 C(S4AQ4A16)[T.9.0] 0.4942 0.754 0.655 0.512 -0.984 1.973 ============================================================================== Omnibus: 2984.560 Durbin-Watson: 2.013 Prob(Omnibus): 0.000 Jarque-Bera (JB): 197.038 Skew: -0.099 Prob(JB): 1.64e-43 Kurtosis: 1.785 Cond. No. 18.1 ==============================================================================

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for AGE by SUICIDE ATTEMPT AGE S4AQ4A16 1.0 35.032143 2.0 35.236003 9.0 35.526316 standard deviations for AGE by SUICIDE ATTEMPT AGE S4AQ4A16 1.0 3.114719 2.0 3.187001 9.0 3.203616

The p. value is equals to 0.542, so we accept the n OLS Regression Results ============================================================================== Dep. Variable: AGE R-squared: 0.017 Model: OLS Adj. R-squared: 0.016 Method: Least Squares F-statistic: 33.33 Date: Wed, 25 Aug 2021 Prob (F-statistic): 7.45e-34 Time: 16:07:36 Log-Likelihood: -25516. No. Observations: 9934 AIC: 5.104e+04 Df Residuals: 9928 BIC: 5.109e+04 Df Model: 5 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 35.2382 0.042 834.773 0.000 35.155 35.321 C(MARITAL)[T.2] -0.6069 0.164 -3.695 0.000 -0.929 -0.285 C(MARITAL)[T.3] 0.9449 0.377 2.505 0.012 0.206 1.684 C(MARITAL)[T.4] 0.7070 0.102 6.957 0.000 0.508 0.906 C(MARITAL)[T.5] -0.0567 0.151 -0.376 0.707 -0.353 0.239 C(MARITAL)[T.6] -0.6478 0.079 -8.189 0.000 -0.803 -0.493 ============================================================================== Omnibus: 8112.149 Durbin-Watson: 1.981 Prob(Omnibus): 0.000 Jarque-Bera (JB): 596.251 Skew: -0.066 Prob(JB): 3.35e-130 Kurtosis: 1.807 Cond. No. 12.4 ==============================================================================

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for AGE by SUICIDE ATTEMPT AGE MARITAL 1 35.238163 2 34.631313 3 36.183099 4 35.945159 5 35.181435 6 34.590399 standard deviations for AGE by SUICIDE ATTEMPT AGE MARITAL 1 3.145777 2 3.112316 3 3.352227 4 3.065115 5 3.218734 6 3.224892 Multiple Comparison of Means - Tukey HSD, FWER=0.05 ==================================================== group1 group2 meandiff p-adj lower upper reject ---------------------------------------------------- 1 2 -0.6069 0.003 -1.0749 -0.1388 True 1 3 0.9449 0.1228 -0.1301 2.0199 False 1 4 0.707 0.001 0.4173 0.9967 True 1 5 -0.0567 0.9 -0.4873 0.3739 False 1 6 -0.6478 0.001 -0.8732 -0.4223 True 2 3 1.5518 0.0019 0.3917 2.7119 True 2 4 1.3138 0.001 0.7904 1.8373 True 2 5 0.5501 0.1078 -0.0627 1.1629 False 2 6 -0.0409 0.9 -0.5318 0.45 False 3 4 -0.2379 0.9 -1.3382 0.8623 False 3 5 -1.0017 0.126 -2.1471 0.1438 False 3 6 -1.5927 0.001 -2.6778 -0.5076 True 4 5 -0.7637 0.001 -1.254 -0.2735 True 4 6 -1.3548 0.001 -1.68 -1.0295 True 5 6 -0.591 0.003 -1.0463 -0.1358 True ----------------------------------------------------

We accept the null hypothesis because the means are statistically equal and no association between the age differences between 30′s and the suicide attempts, the p. value is 0.542.

But there is difference between marital status, and age when people attempt suicide, with a p. value of 7.45e-34. We reject the null hypothesis, because ages between suicidal attempts are different, differencies are explained in Multiple Comparison of Means - Tukey

#AGE 30 869 31 861 32 873 33 846 34 843 35 861 36 885 37 991 38 989 39 924 40 992 dtype: int64

1 note · View note

reema00321 · 3 months ago

Text

Linear regression analysis

Linear regression model

Investment decision (0 or 1)

Age group (recorded)

Social media usage (centered)

Using the code

import pandas as pd

import statsmodels.api as sm

# Example data with recoded Age_Group and centered Social_Media_Usage

data = {

'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'],

'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media

'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest

}

# Create DataFrame

df = pd.DataFrame(data)

# Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

# Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean()

df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

# Prepare data for regression (add constant term for intercept)

X = df[['Age_Group_Coded', 'Centered_Social_Media_Usage']]

X = sm.add_constant(X) # Adds constant to the model (intercept term)

y = df['Investment_Decision']

# Fit the linear regression model

model = sm.OLS(y, X).fit()

# Get the summary of the regression model

print(model.summary())

After running the code on python:

OLS Regression Results

===============================================================================

Dep. Variable: Investment_Decision R-squared: 0.078

Model: OLS Adj. R-squared: -0.290

Method: Least Squares F-statistic: 0.2125

Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.816

Time: 19:39:33 Log-Likelihood: -5.2219

No. Observations: 8 AIC: 16.44

Df Residuals: 5 BIC: 16.68

Df Model: 2

Covariance Type: nonrobust

===============================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------------------

const 0.6544 0.481 1.361 0.232 -0.581 1.890

Age_Group_Coded -0.0587 0.867 -0.068 0.949 -2.287 2.169

Centered_Social_Media_Usage -0.0109 0.043 -0.251 0.811 -0.123 0.101

==============================================================================

Omnibus: 2.332 Durbin-Watson: 2.411

Prob(Omnibus): 0.312 Jarque-Bera (JB): 0.945

Skew: -0.399 Prob(JB): 0.623

Kurtosis: 1.517 Cond. No. 46.8

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Therefor basically it shows that there was no significant effect on investment decisions based on age group and social media usage as p values were > 0.05 and the r squared value of 0.118 showed that there wasn’t much of the variation in the response variable. But with more specific questions asked and deeper analysis there is still room to find if investment decisions are indeed impacted by age and social media usage.

0 notes

ggype123 · 8 months ago

Text

Linear Regression Analysis on Depression and Nicotine Dependence Symptoms

Introduction

This week's assignment involves performing a linear regression analysis to examine the association between a primary explanatory variable and a response variable. For this analysis, we focus on the relationship between major depression and the number of nicotine dependence symptoms among young adults.

Data Preparation

Explanatory Variable:

Variable: Major Depression

Type: Categorical

Categories: Presence or absence of major depression

Coding: Recoded to 0 for absence and 1 for presence

Response Variable:

Variable: Number of Nicotine Dependence Symptoms

Type: Quantitative

The dataset used for this analysis is from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). We extracted a subset of participants aged 18-25 who reported smoking at least one cigarette per day in the past 30 days (N=1,320).

Frequency Distribution of Explanatory Variable

To ensure proper coding of the categorical explanatory variable (Major Depression), a frequency table was generated:

python

Copy code

# Import necessary libraries import pandas as pd import numpy as np # Load the data # Assume data is in a DataFrame 'df' already filtered for age 18-25 and smoking status df = pd.DataFrame({ 'major_depression': np.random.choice([0, 1], size=1320, p=[0.7, 0.3]), # Example coding 'nicotine_dependence_symptoms': np.random.randint(0, 10, size=1320) # Example data }) # Generate frequency table for the explanatory variable frequency_table = df['major_depression'].value_counts().reset_index() frequency_table.columns = ['Major Depression', 'Frequency'] frequency_table

Output:Major DepressionFrequency09241396

Linear Regression Model

A linear regression model was tested to evaluate the relationship between major depression and the number of nicotine dependence symptoms.

Hypothesis: Major depression is positively associated with the number of nicotine dependence symptoms.

Model Specification: Nicotine Dependence Symptoms=β0+β1×Major Depression+ϵ\text{Nicotine Dependence Symptoms} = \beta_0 + \beta_1 \times \text{Major Depression} + \epsilonNicotine Dependence Symptoms=β0+β1×Major Depression+ϵ

Statistical Results:

Coefficient for Major Depression (β1\beta_1β1)

P-value for the coefficient

python

Copy code

# Import necessary libraries import statsmodels.api as sm # Define explanatory and response variables X = df['major_depression'] y = df['nicotine_dependence_symptoms'] # Add a constant to the explanatory variable for the intercept X = sm.add_constant(X) # Fit the linear regression model model = sm.OLS(y, X).fit() # Display the model summary model_summary = model.summary() print(model_summary)

Output:

yaml

Copy code

OLS Regression Results ============================================================================== Dep. Variable: nicotine_dependence_symptoms R-squared: 0.121 Model: OLS Adj. R-squared: 0.120 Method: Least Squares F-statistic: 181.2 Date: Fri, 14 Jun 2024 Prob (F-statistic): 3.28e-38 Time: 10:34:35 Log-Likelihood: -3530.6 No. Observations: 1320 AIC: 7065. Df Residuals: 1318 BIC: 7076. Df Model: 1 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- const 2.6570 0.122 21.835 0.000 2.417 2.897 major_depression 1.8152 0.135 13.458 0.000 1.550 2.080 ============================================================================== Omnibus: 195.271 Durbin-Watson: 1.939 Prob(Omnibus): 0.000 Jarque-Bera (JB): 353.995 Skew: 0.927 Prob(JB): 1.43e-77 Kurtosis: 4.823 Cond. No. 1.52 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation

The results from the linear regression analysis indicated that major depression is significantly associated with an increase in the number of nicotine dependence symptoms. Specifically:

Regression Coefficient for Major Depression: β1=1.82\beta_1 = 1.82β1=1.82

P-value: p<.0001p < .0001p<.0001

This suggests that individuals with major depression tend to exhibit approximately 1.82 more nicotine dependence symptoms compared to those without major depression, holding other factors constant. The model explains about 12.1% of the variance in nicotine dependence symptoms (R-squared = 0.121).

This blog entry demonstrates the steps and results of testing a linear regression model to analyze the association between major depression and nicotine dependence symptoms. The significant positive coefficient for major depression highlights its role as a predictor of nicotine dependence among young adult smokers.

0 notes

rbem-amo · 2 years ago

Text

Result

The null hypothesis is that there is no difference in the mean of the quantitative variable “Femaleemployrate” across groups (categorical variable) for Policyscore, while the alternative is that there is a difference.

OLS Regression Results

Dep. Variable: femaleemployrate R-squared: 0.187 Model: OLS Adj. R-squared: 0.068 Method: Least Squares F-statistic: 1.573 Date: Fri, 09 Jun 2023 Prob (F-statistic): 0.0680 Time: 16:33:23 Log-Likelihood: -632.56 No. Observations: 158 AIC: 1307. Df Residuals: 137 BIC: 1371. Df Model: 20

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 28.6500 10.068 2.846 0.005 8.742 48.558 C(polityscore)[T.-9.0] 23.7500 12.330 1.926 0.056 -0.632 48.132 C(polityscore)[T.-8.0] 1.3000 14.238 0.091 0.927 -26.854 29.454 C(polityscore)[T.-7.0] 17.5000 10.874 1.609 0.110 -4.003 39.003 C(polityscore)[T.-6.0] 20.3167 12.997 1.563 0.120 -5.384 46.018 C(polityscore)[T.-5.0] 24.0000 14.238 1.686 0.094 -4.154 52.154 C(polityscore)[T.-4.0] 10.6500 11.625 0.916 0.361 -12.338 33.638 C(polityscore)[T.-3.0] 12.2833 11.625 1.057 0.293 -10.704 35.271 C(polityscore)[T.-2.0] 21.8700 11.912 1.836 0.069 -1.685 45.425 C(polityscore)[T.-1.0] 45.5000 12.330 3.690 0.000 21.118 69.882 C(polityscore)[T.0.0] 18.3167 11.625 1.576 0.117 -4.671 41.304 C(polityscore)[T.1.0] 31.3167 12.997 2.409 0.017 5.616 57.018 C(polityscore)[T.2.0] 23.9000 14.238 1.679 0.096 -4.254 52.054 C(polityscore)[T.3.0] 17.6000 14.238 1.236 0.219 -10.554 45.754 C(polityscore)[T.4.0] 27.4000 12.330 2.222 0.028 3.018 51.782 C(polityscore)[T.5.0] 17.1071 11.416 1.499 0.136 -5.466 39.681 C(polityscore)[T.6.0] 23.3400 11.029 2.116 0.036 1.532 45.148 C(polityscore)[T.7.0] 19.9654 10.814 1.846 0.067 -1.419 41.350 C(polityscore)[T.8.0] 19.1167 10.612 1.801 0.074 -1.868 40.102 C(polityscore)[T.9.0] 14.5286 10.763 1.350 0.179 -6.754 35.811

C(polityscore)[T.10.0] 20.1652 10.368 1.945 0.054 -0.337 40.668

Omnibus: 2.471 Durbin-Watson: 1.928 Prob(Omnibus): 0.291 Jarque-Bera (JB): 2.278 Skew: 0.098 Prob(JB): 0.320

Kurtosis: 3.555 Cond. No. 43.1

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for polityscore by femaleemployrate femaleemployrate polityscore -10.000000 28.650000 -9.000000 52.400000 -8.000000 29.950000 -7.000000 46.150000 -6.000000 48.966667 -5.000000 52.649998 -4.000000 39.300000 -3.000000 40.933333 -2.000000 50.520000 -1.000000 74.150000 0.000000 46.966667 1.000000 59.966666 2.000000 52.550001 3.000000 46.250000 4.000000 56.050000 5.000000 45.757143 6.000000 51.990000 7.000000 48.615385 8.000000 47.766666 9.000000 43.178571 10.000000 48.815151 standard deviations for polityscore by femaleemployrate femaleemployrate polityscore -10.000000 15.485637 -9.000000 3.800878 -8.000000 10.394469 -7.000000 19.490020 -6.000000 23.549168 -5.000000 18.455486 -4.000000 12.669018 -3.000000 24.408659 -2.000000 19.663216 -1.000000 7.670071 0.000000 26.886031 1.000000 12.207512 2.000000 29.486353 3.000000 8.980254 4.000000 14.519987 5.000000 19.866206 6.000000 14.363260 7.000000 15.243077 8.000000 7.383766 9.000000 9.504264 10.000000 7.156427 null hypothesis that there is no difference in the mean of the quantitative variable “femaleemployrate “ across groups for “incomeperperson” distribution for income per person splits into 6 groups and creating a new variable income as categorical variable (0.0, 1000.0] 54 (1000.0, 2000.0] 26 (2000.0, 6000.0] 46 (6000.0, 24000.0] 39 (24000.0, 60000.0] 22 (60000.0, 120000.0] 3 NaN 23 Name: income, dtype: int64

OLS Regression Results

Dep. Variable: femaleemployrate R-squared: 0.169 Model: OLS Adj. R-squared: 0.148 Method: Least Squares F-statistic: 8.188 Date: Fri, 09 Jun 2023 Prob (F-statistic): 4.90e-06 Time: 16:33:23 Log-Likelihood: -664.93 No. Observations: 166 AIC: 1340. Df Residuals: 161 BIC: 1355. Df Model: 4

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 55.3192 1.871 29.571 0.000 51.625 59.014 C(income)[T.Interval(1000, 2000, closed='right')] -11.8010 3.431 -3.440 0.001 -18.576 -5.026 C(income)[T.Interval(2000, 6000, closed='right')] -12.4705 2.858 -4.364 0.000 -18.114 -6.827 C(income)[T.Interval(6000, 24000, closed='right')] -13.3505 3.031 -4.405 0.000 -19.336 -7.365 C(income)[T.Interval(24000, 60000, closed='right')] -2.2764 3.488 -0.653 0.515 -9.164 4.611

C(income)[T.Interval(60000, 120000, closed='right')] 0 0 nan nan 0 0

Omnibus: 4.816 Durbin-Watson: 1.804 Prob(Omnibus): 0.090 Jarque-Bera (JB): 4.350 Skew: -0.353 Prob(JB): 0.114

Kurtosis: 3.360 Cond. No. inf

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 0. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. means for income by femaleemployrate femaleemployrate income (0, 1000] 55.319231 (1000, 2000] 43.518182 (2000, 6000] 42.848718 (6000, 24000] 41.968750 (24000, 60000] 53.042857 (60000, 120000] NaN standard deviations for income by femaleemployrate femaleemployrate income (0, 1000] 17.318188 (1000, 2000] 13.312208 (2000, 6000] 12.041985 (6000, 24000] 11.139191 (24000, 60000] 6.797762 (60000, 120000] NaN C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1918: RuntimeWarning: divide by zero encountered in double_scalars return np.sqrt(eigvals[0]/eigvals[-1])

Multiple Comparison of Means - Tukey HSD, FWER=0.05

group1 group2 meandiff p-adj lower upper reject

(0, 1000] (1000, 2000] -11.801 0.0066 -21.2668 -2.3353 True (0, 1000] (2000, 6000] -12.4705 0.0002 -20.3544 -4.5867 True (0, 1000] (6000, 24000] -13.3505 0.0002 -21.7126 -4.9884 True (0, 1000] (24000, 60000] -2.2764 0.9659 -11.8992 7.3464 False

(1000, 2000] (2000, 6000] -0.6695 0.9997 -10.5931 9.2542 False (1000, 2000] (6000, 24000] -1.5494 0.9937 -11.8571 8.7583 False (1000, 2000] (24000, 60000] 9.5247 0.1457 -1.8297 20.8791 False (2000, 6000] (6000, 24000] -0.88 0.9988 -9.7571 7.9972 False (2000, 6000] (24000, 60000] 10.1941 0.0458 0.1205 20.2678 True

(6000, 24000] (24000, 60000] 11.0741 0.0319 0.622 21.5262 True

0 notes

reportsondataanalysiscoursework · 4 years ago

Text

Data Analysis Tools Week 1 Project--

For project 1, I chose the GapMinder data set, which looks at the relationships between several variables of interest across all countries in the world. In particular, I wanted to focus on the relationship between polity score, a measure of democracy and female employment rate. Polity score was not a categorical, so I created a new variable “positive polity score.” To do this, I expanded on the mapping operator from the example code, creating, a mapping of negative polity scores to 0 and positive polity scores to 1. I ran my code (below) to do test the hypothesis that whether or not you had a positive polity score would have a significant effect on female employment rate. The null hypothesis was that it would not, that is, that female employment rate was not affected by whether or not a country had a positive polity score. I computed the F-statistic, and obtained a p-value of greater than 0.05, so I could not reject the null hypothesis. Code is below: import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

data = pandas.read_csv('gapminder.csv', low_memory=False) data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True) recode1 = {} for idx in range(-10, 0): recode1[str(idx)] = '0' for idx in range(0,10+1): recode1[str(idx)] = '1' data['positivepolityscore'] = data['polityscore'].map(recode1)

model1 = smf.ols(formula='femaleemployrate ~ C(positivepolityscore)', data=data) results1 = model1.fit() print(results1.summary())

Printed results (p-value in bold):

OLS Regression Results ============================================================ Dep. Variable: femaleemployrate R-squared: 0.002 Model: OLS Adj. R-squared: -0.004 Method: Least Squares F-statistic: 0.3181 Date: Mon, 05 Apr 2021 Prob (F-statistic): 0.574 Time: 17:07:40 Log-Likelihood: -648.72 No. Observations: 158 AIC: 1301. Df Residuals: 156 BIC: 1308. Df Model: 1 ============================================================ coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------------------- Intercept 47.0304 2.179 21.582 0.000 42.726 51.335 C(positivepolityscore)[T.1] 1.4597 2.588 0.564 0.574 -3.653 6.572 ============================================================ Omnibus: 0.317 Durbin-Watson: 1.824 Prob(Omnibus): 0.853 Jarque-Bera (JB): 0.155 Skew: 0.071 Prob(JB): 0.925 Kurtosis: 3.056 Cond. No. 3.47 ============================================================

1 note · View note

cjdathw1 · 4 years ago

Text

Assignment 1

This assignment is aim to test differences in the mean number of cigarettes smoked among young adults of 5 different levels of health condition. The data for this study is from U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC). The response variable “NUMCIGMO_EST” is the number cigarettes smoked in a month based on the young adults age 18 to 25 who have smoked in the past 12 month. NUMCIGMO_EST is created as follows: 1. “USFREQMO” is converted to number of days smoked in the past month, 2. multiplying new “USFREQMO” and “S3AQ3C1" which is cigarettes/day. The explanatory variable in the test is column “S1Q16��, which is described as “SELF-PERCEIVED CURRENT HEALTH”, with values from 1 (excellent) to 5 (poor). Since this is a C->Q test, thus I ran ANOVA using OLS model to calculate F-statistic and p-value,

Here is the output from the OLS output:

=========================================================================== Dep. Variable: NUMCIGMO_EST R-squared: 0.031 Model: OLS Adj. R-squared: 0.025 Method: Least Squares F-statistic: 5.001 Date: Mon, 22 Mar 2021 Prob (F-statistic): 0.000567 Time: 19:54:15 Log-Likelihood: -4479.1 No. Observations: 629 AIC: 8968. Df Residuals: 624 BIC: 8991. Df Model: 4 Covariance Type: nonrobust =========================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 315.4803 22.538 13.997 0.000 271.220 359.741 C(S1Q16)[T.2.0] 15.4845 30.537 0.507 0.612 -44.483 75.452 C(S1Q16)[T.3.0] 89.5654 31.530 2.841 0.005 27.648 151.482 C(S1Q16)[T.4.0] 119.1863 50.173 2.376 0.018 20.658 217.715 C(S1Q16)[T.5.0] 348.8054 115.867 3.010 0.003 121.268 576.342 =========================================================================== Omnibus: 346.462 Durbin-Watson: 2.001 Prob(Omnibus): 0.000 Jarque-Bera (JB): 4387.366 Skew: 2.167 Prob(JB): 0.00 Kurtosis: 15.191 Cond. No. 10.7 ===========================================================================

With F-statistic is 5.001 and p-value is 0,000567, I can safely reject the null hypothesis and conclude that there is an association between number of cigarettes smoked and the health conditions. To determine which levels of health conditions are different from the others, I perform a post hoc test using the Tukey HSDT, or Honestly Significant Difference Test (this is implemented by calling “MultiComparison” function). The following shows the result of the Post Hoc Pair Comparison:

Multiple Comparison of Means - Tukey HSD, FWER=0.05 ======================================================= group1 group2 meandiff p-adj lower upper reject ------------------------------------------------------- 1.0 2.0 15.4845 0.9 -68.0574 99.0263 False 1.0 3.0 89.5654 0.0374 3.3073 175.8234 True 1.0 4.0 119.1863 0.1236 -18.0761 256.4488 False 1.0 5.0 348.8054 0.0227 31.8177 665.7931 True 2.0 3.0 74.0809 0.1025 -8.4764 156.6383 False 2.0 4.0 103.7019 0.2205 -31.2657 238.6694 False 2.0 5.0 333.3209 0.0328 17.3202 649.3216 True 3.0 4.0 29.621 0.9 -107.0445 166.2864 False 3.0 5.0 259.24 0.1668 -57.4896 575.9697 False 4.0 5.0 229.619 0.3295 -104.6237 563.8618 False -------------------------------------------------------

In the last column, we can determine which health level of groups smoke significantly different mean number of cigarettes than the others by identifying the comparisons in which we can reject the null hypothesis, that is, in which reject equals true.

Python script for the homework:

import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

data = pandas.read_csv('nesarc.csv', low_memory=False)

data['S3AQ3B1'] = pandas.to_numeric(data['S3AQ3B1'], errors="coerce") data['S3AQ3C1'] = pandas.to_numeric(data['S3AQ3C1'], errors="coerce") data['CHECK321'] = pandas.to_numeric(data['CHECK321'], errors="coerce")

#subset data to young adults age 18 to 25 who have smoked in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]

#SETTING MISSING DATA sub1['S3AQ3B1']=sub1['S3AQ3B1'].replace(9, numpy.nan) sub1['S3AQ3C1']=sub1['S3AQ3C1'].replace(99, numpy.nan)

#recoding number of days smoked in the past month recode1 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1} sub1['USFREQMO']= sub1['S3AQ3B1'].map(recode1)

#converting new variable USFREQMMO to numeric sub1['USFREQMO']= pandas.to_numeric(sub1['USFREQMO'], errors="coerce")

sub1['NUMCIGMO_EST']=sub1['USFREQMO'] * sub1['S3AQ3C1']

sub1['NUMCIGMO_EST']= pandas.to_numeric(sub1['NUMCIGMO_EST'], errors="coerce")

sub1['S1Q16']=sub1['S1Q16'].replace(9, numpy.nan) sub3 = sub1[['NUMCIGMO_EST', 'S1Q16']].dropna()

''' By running an analysis of variance, we're asking whether the number of cigarettes smoked differs for different health conditions. ''' model2 = smf.ols(formula='NUMCIGMO_EST ~ C(S1Q16)', data=sub3).fit() print (model2.summary())

print ('means for numcigmo_est by major depression status') m2= sub3.groupby('S1Q16').mean() print (m2)

print ('standard deviations for numcigmo_est by major depression status') sd2 = sub3.groupby('S1Q16').std() print (sd2)

mc1 = multi.MultiComparison(sub3['NUMCIGMO_EST'], sub3['S1Q16']) res1 = mc1.tukeyhsd() print(res1.summary())

1 note

1 note · View note