perimichael - Tumblr blog

perimichael · 7 months ago

Text

EXAMPLE USING ANOVA

/Example Using ANOVA/ LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;

PROC SORT; BY EXERCISE;

PROC ANOVA; CLASS REGION; MODEL AGE=REGION; MEANS REGION; BY S2AQ6B;

RUN;

The ANOVA Procedure

S2AQ6B=.Class Level InformationClassLevelsValuesREGION32 3 4Number of Observations Read5Number of Observations Used5

The ANOVA Procedure

Dependent Variable: AGE

S2AQ6B=.SourceDFSum of SquaresMean SquareF ValuePr > FModel22400.3000001200.15000013.760.0678Error2174.50000087.250000 Corrected Total42574.800000 R-SquareCoeff VarRoot MSEAGE Mean0.93222825.109609.34077137.20000SourceDFAnova SSMean SquareF ValuePr > FREGION22400.3000001200.15000013.760.0678

The ANOVA Procedure

S2AQ6B=.

Due to the p alue of .0678 there is no moderation between the age, region and drinking of wine

0 notes

perimichael · 7 months ago

Text

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;

IF S1Q6A eq 1 THEN incomegroup=.; ELSE IF S1Q6A LE 3 THEN education=1; ELSE IF S1Q6A LE 5 THEN education=2; ELSE IF S1Q6A LE 7 THEN education=3; ELSE IF S1Q6A GT 10 THEN education=3;

PROC SORT; by IDNUM;

PROC CORR; VAR S1Q10B S1Q11B;

RUN;

The CORR Procedure2 Variables:S1Q10B S1Q11BSimple StatisticsVariableNMeanStd DevSumMinimumMaximumS1Q10B430936.757504.40665291201017.00000S1Q11B430939.421414.843084059971.0000021.00000Pearson Correlation Coefficients, N = 43093 Prob > |r| under H0: Rho=0 S1Q10BS1Q11BS1Q10B

1.00000

0.67157

<.0001S1Q11B

0.67157

<.0001

1.00000

both education level is correlated to individual income and household income as shown by the data above. this is because the p value is less than .0001

0 notes

perimichael · 7 months ago

Text

CHi 2

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;

LABEL TAB12MDX="Tobacco Dependence Past 12 Months" CHECK321="Smoked Cigarettes in Past 12 Months" S3AQ3B1="Usual Smoking Frequency" S3AQ3C1="Usual Smoking Quantity" S1Q6A="Highest Grade or year of school completed" S2AQ8A="How often drank any alcohol in the last 12 months"

/Set appropriate missing data as needed/ IF S3AQ3B1=9 THEN S3AQ3B1=.; IF S3AQ3C1=99 THEN S3AQ3C1=.; IF S2AQ8A=99 THEN S2AQ8A=.; IF S1Q6A= 1 THEN S1Q6A= .;

IF S1Q6A= 2-4 THEN GRADE=1; IF S1Q6A= 5-7 THEN GRADE=2; IF S1Q6A= 8-9 THEN GRADE=3; IF S1Q6A=10-14 THEN GRADE=4;

INTELLIGENCE=GRADE*S1Q6A;

/subsetting data to include only age greater than 18/

IF AGE GE 18;

PROC SORT; by IDNUM;

PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;

RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=2; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=3; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8AS1Q6A/CHISQ; RUN; DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=4; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8AS1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=5; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=6; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=7; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=8; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=9; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=10; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=11; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=12; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=13; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=14; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ; RUN;

The table tells me about the probabilty of correlation between varibales of grade level completed vs alcohol use. However based on the CHi2 value I don't think I can make any valid conclusions about this data being correlated.

0 notes

perimichael · 7 months ago

Text

Data analysis for grade vs gender and race

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.nesarc_pds;

LABEL TAB12MDX="Tobacco Dependence Past 12 Months" CHECK321="Smoked Cigarettes in Past 12 Months" S3AQ3B1="Usual Smoking Frequency" S3AQ3C1="Usual Smoking Quantity" S1Q6A="Highest Grade or year of school completed" S2AQ8A="How often drank any alcohol in the last 12 months"

/Set appropriate missing data as needed/ IF S3AQ3B1=9 THEN S3AQ3B1=.; IF S3AQ3C1=99 THEN S3AQ3C1=.; IF S2AQ8A=99 THEN S2AQ8A=.; IF S1Q6A= 1 THEN S1Q6A= .;

IF S1Q6A= 2-4 THEN GRADE=1; IF S1Q6A= 5-7 THEN GRADE=2; IF S1Q6A= 8-9 THEN GRADE=3; IF S1Q6A=10-14 THEN GRADE=4;

INTELLIGENCE=GRADE*S1Q6A;

/subsetting data to include only age greater than 18/

IF AGE GE 18;

PROC SORT; by IDNUM;

PROC ANOVA; CLASS SEX; MODEL S1Q6A=SEX; MEANS SEX;

PROC ANOVA; CLASS ETHRACE2A; MODEL S1Q6A=ETHRACE2A; MEANS ETHRACE2A/DUNCAN;

RUN;

The ANOVA ProcedureClass Level InformationClassLevelsValuesSEX21 2Number of Observations Read43093Number of Observations Used42875

The ANOVA Procedure

Dependent Variable: S1Q6A Highest Grade or year of school completedSourceDFSum of SquaresMean SquareF ValuePr > FModel1112.9008112.900818.75<.0001Error42873258168.30276.0217 Corrected Total42874258281.2035 R-SquareCoeff VarRoot MSES1Q6A Mean0.00043725.847022.4539159.493994SourceDFAnova SSMean SquareF ValuePr > FSEX1112.9008106112.900810618.75<.0001

The ANOVA Procedure

The ANOVA ProcedureClass Level InformationClassLevelsValuesETHRACE2A51 2 3 4 5Number of Observations Read43093Number of Observations Used42875

The ANOVA Procedure

Dependent Variable: S1Q6A Highest Grade or year of school completedSourceDFSum of SquaresMean SquareF ValuePr > FModel415981.16333995.2908706.88<.0001Error42870242300.04025.6520 Corrected Total42874258281.2035 R-SquareCoeff VarRoot MSES1Q6A Mean0.06187525.040962.3773889.493994SourceDFAnova SSMean SquareF ValuePr > FETHRACE2A415981.163283995.29082706.88<.0001

The ANOVA Procedure

Duncan's Multiple Range Test for S1Q6A

Note:This test controls the Type I comparisonwise error rate, not the experimentwise error rate.Alpha0.05Error Degrees of Freedom42870Error Mean Square5.651972Harmonic Mean of Cell Sizes2007.656

Note:Cell sizes are not equal.Number of Means2345Critical Range.1471.1549.1601.1639

Males generally would have a higher eudctation level but, were also more likely to not even attend school or drop out early.

Hispanic and latinos have the lowest degree of education. Asian tend to have the highest.

0 notes

perimichael · 7 months ago

Text

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt

Read the dataset

data = pd.read_csv('gapminder.csv', low_memory=False)

Replace empty strings with NaN

data = data.replace(r'^\s*$', np.NaN, regex=True)

Setting variables you will be working with to numeric

data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce') data['urbanrate'] = pd.to_numeric(data['urbanrate'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['hivrate'] = pd.to_numeric(data['hivrate'], errors='coerce')

#

data['country'] = pd.to_numeric(data['country'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['employrate'] = pd.to_numeric(data['hivrate'], errors='coerce')

Descriptive statistics

desc1 = data['alcconsumption'].describe() print(desc1)

desc2 = data['employrate'].describe() print(desc2)

Basic scatterplot: Q -> Q

plt.figure(figsize=(10, 6)) sns.regplot(x="alcconsumption", y="hivrate", fit_reg=False, data=data) plt.xlabel('alcconsumption') plt.ylabel('hivrate') plt.title('Scatterplot for the Association Between alcconsumption and HIVRate') plt.show()

plt.figure(figsize=(10, 6)) sns.regplot(x="lifeexpectancy", y="employrate", data=data) plt.xlabel('lifeexpectancy') plt.ylabel('employrate') plt.title('Scatterplot for the Association Between Life Expectancy and Employment Rate') plt.show()

Quartile split (use qcut function & ask for 4 groups - gives you quartile split)

print('Employrate - 4 categories - quartiles') data['EMPLOYRATE4'] = pd.qcut(data['employrate'], 4, labels=["1=25th%tile", "2=50%tile", "3=75%tile", "4=100%tile"]) c10 = data['EMPLOYRATE4'].value_counts(sort=False, dropna=True) print(c10)

Bivariate bar graph C -> Q

plt.figure(figsize=(10, 6)) sns.catplot(x='EMPLOYRATE4', y='alcconsumption', data=data, kind="bar", ci=None) plt.xlabel('Employ Rate') plt.ylabel('Alcohol Consumption Rate') plt.title('Mean Alcohol Consumption Rate by Income Group') plt.show()

c11 = data.groupby('EMPLOYRATE4').size() print(c11)

result = data.sort_values(by=['EMPLOYRATE4'], ascending=True) print(result)

My data shows little correlation between alcohol consumption ahd HIV Rate but, When life expectancy is lower employment rate is higher which is probably because people retire as they get older. Finally, I found that in each income group the average alcohol consumption rate down as people make more money

0 notes

perimichael · 7 months ago

Text

weight of smokers

import pandas as pd import numpy as np

Read the dataset

data = pd.read_csv('nesarc_pds.csv', low_memory=False)

Bug fix for display formats to avoid runtime errors

pd.set_option('display.float_format', lambda x: '%f' % x)

Setting variables to numeric

numeric_columns = ['TAB12MDX', 'CHECK321', 'S3AQ3B1', 'S3AQ3C1', 'WEIGHT'] data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric)

Subset data to adults 100 to 600 lbs who have smoked in the past 12 months

sub1 = data[(data['WEIGHT'] >= 100) & (data['WEIGHT'] <= 600) & (data['CHECK321'] == 1)]

Make a copy of the subsetted data

sub2 = sub1.copy()

def print_value_counts(df, column, description): """Print value counts for a specific column in the dataframe.""" print(f'{description}') counts = df[column].value_counts(sort=False, dropna=False) print(counts) return counts

Initial counts for S3AQ3B1

print_value_counts(sub2, 'S3AQ3B1', 'Counts for original S3AQ3B1')

Recode missing values to NaN

sub2['S3AQ3B1'].replace(9, np.nan, inplace=True) sub2['S3AQ3C1'].replace(99, np.nan, inplace=True)

Counts after recoding missing values

print_value_counts(sub2, 'S3AQ3B1', 'Counts for S3AQ3B1 with 9 set to NaN and number of missing requested')

Recode missing values for S2AQ8A

sub2['S2AQ8A'].fillna(11, inplace=True) sub2['S2AQ8A'].replace(99, np.nan, inplace=True)

Check coding for S2AQ8A

print_value_counts(sub2, 'S2AQ8A', 'S2AQ8A with Blanks recoded as 11 and 99 set to NaN') print(sub2['S2AQ8A'].describe())

Recode values for S3AQ3B1 into new variables

recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}

sub2['USFREQ'] = sub2['S3AQ3B1'].map(recode1) sub2['USFREQMO'] = sub2['S3AQ3B1'].map(recode2)

Create secondary variable

sub2['NUMCIGMO_EST'] = sub2['USFREQMO'] * sub2['S3AQ3C1']

Examine frequency distributions for WEIGHT

print_value_counts(sub2, 'WEIGHT', 'Counts for WEIGHT') print('Percentages for WEIGHT') print(sub2['WEIGHT'].value_counts(sort=False, normalize=True))

Quartile split for WEIGHT

sub2['WEIGHTGROUP4'] = pd.qcut(sub2['WEIGHT'], 4, labels=["1=0%tile", "2=25%tile", "3=50%tile", "4=75%tile"]) print_value_counts(sub2, 'WEIGHTGROUP4', 'WEIGHT - 4 categories - quartiles')

Categorize WEIGHT into 3 groups (100-200 lbs, 200-300 lbs, 300-600 lbs)

sub2['WEIGHTGROUP3'] = pd.cut(sub2['WEIGHT'], [100, 200, 300, 600], labels=["100-200 lbs", "201-300 lbs", "301-600 lbs"]) print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3')

Crosstab of WEIGHTGROUP3 and WEIGHT

print(pd.crosstab(sub2['WEIGHTGROUP3'], sub2['WEIGHT']))

Frequency distribution for WEIGHTGROUP3

print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3') print('Percentages for WEIGHTGROUP3') print(sub2['WEIGHTGROUP3'].value_counts(sort=False, normalize=True))

Counts for original S3AQ3B1 S3AQ3B1 1.000000 81 2.000000 6 5.000000 2 4.000000 6 3.000000 3 6.000000 4 Name: count, dtype: int64 Counts for S3AQ3B1 with 9 set to NaN and number of missing requested S3AQ3B1 1.000000 81 2.000000 6 5.000000 2 4.000000 6 3.000000 3 6.000000 4 Name: count, dtype: int64 S2AQ8A with Blanks recoded as 11 and 99 set to NaN S2AQ8A 6 12 4 2 7 14 5 16 28 1 6 2 2 10 9 3 5 9 5 8 3 Name: count, dtype: int64 count 102 unique 11 top freq 28 Name: S2AQ8A, dtype: object Counts for WEIGHT WEIGHT 534.703087 1 476.841101 5 534.923423 1 568.208544 1 398.855701 1 .. 584.984241 1 577.814060 1 502.267758 1 591.875275 1 483.885024 1 Name: count, Length: 86, dtype: int64 Percentages for WEIGHT WEIGHT 534.703087 0.009804 476.841101 0.049020 534.923423 0.009804 568.208544 0.009804 398.855701 0.009804

584.984241 0.009804 577.814060 0.009804 502.267758 0.009804 591.875275 0.009804 483.885024 0.009804 Name: proportion, Length: 86, dtype: float64 WEIGHT - 4 categories - quartiles WEIGHTGROUP4 1=0%tile 26 2=25%tile 25 3=50%tile 25 4=75%tile 26 Name: count, dtype: int64 Counts for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0 201-300 lbs 0 301-600 lbs 102 Name: count, dtype: int64 WEIGHT 398.855701 437.144557 … 599.285226 599.720557 WEIGHTGROUP3 … 301-600 lbs 1 1 … 1 1

[1 rows x 86 columns] Counts for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0 201-300 lbs 0 301-600 lbs 102 Name: count, dtype: int64 Percentages for WEIGHTGROUP3 WEIGHTGROUP3 100-200 lbs 0.000000 201-300 lbs 0.000000 301-600 lbs 1.000000 Name: proportion, dtype: float64

I changed the code to see the weight of smokers who have smoked in the past year. For weight group 3, 102 people over 102lbs have smoked in the last year

0 notes

perimichael · 7 months ago

Text

gapminder data review

-- coding: utf-8 --

""" Spyder Editor

This is a temporary script file. """

import pandas import numpy

read file

data = pandas.read_csv('gapminder.csv', low_memory=False)

set columns to uppercase

data.columns = map(str.upper, data.columns)

bug fix

pandas.set_option("display.float_format", lambda x: "%f" %x)

print # of rows and columns

print(len(data)) # number of observations (rows) print(len(data.columns)) # number of observations (columns)

print income per person and %

print("counts for incomeperperson = income per person") c1 = data["INCOMEPERPERSON"].value_counts(sort=False) print (c1)

print("percentages for income per person") p1 = data["INCOMEPERPERSON"].value_counts(sort=False, normalize=True) print(p1)

print HIVRATE and %

print("counts for HIV RATE") c1 = data["HIVRATE"].value_counts(sort=False) print (c1)

print("percentages for HIV RATE") p1 = data["HIVRATE"].value_counts(sort=False, normalize=True) print(p1)

print ALCOHOL CONSUMPTION and %

print("counts for ALCOHOL CONSUMPTION") c1 = data["ALCCONSUMPTION"].value_counts(sort=False) print (c1)

print("percentages for ALCOHOL CONSUMPTION") p1 = data["ALCCONSUMPTION"].value_counts(sort=False, normalize=True) print(p1)

213 16 counts for incomeperperson = income per person INCOMEPERPERSON 23 1914.99655094922 1 2231.99333515006 1 21943.3398976022 1 1381.00426770244 1 .. 5528.36311387522 1 722.807558834445 1 610.3573673206 1 432.226336974583 1 320.771889948584 1 Name: count, Length: 191, dtype: int64 percentages for income per person INCOMEPERPERSON 0.107981 1914.99655094922 0.004695 2231.99333515006 0.004695 21943.3398976022 0.004695 1381.00426770244 0.004695

5528.36311387522 0.004695 722.807558834445 0.004695 610.3573673206 0.004695 432.226336974583 0.004695 320.771889948584 0.004695 Name: proportion, Length: 191, dtype: float64 counts for HIV RATE HIVRATE 66 .1 28 2 2 .5 5 .3 10 3.1 1 .06 16 1.4 1 .2 15 2.3 1 1.2 4 24.8 1 .45 1 3.3 1 5.3 1 4.7 1 3.4 3 .4 9 2.5 2 .9 4 .8 5 5 1 5.2 1 1.8 1 1.3 2 1.9 1 1.7 1 6.3 1 .7 3 23.6 1 1.5 2 11 1 1 4 11.5 1 .6 3 13.1 1 3.6 1 2.9 1 1.6 1 17.8 1 1.1 2 25.9 1 5.6 1 3.2 1 6.5 1 13.5 1 14.3 1 Name: count, dtype: int64 percentages for HIV RATE HIVRATE 0.309859 .1 0.131455 2 0.009390 .5 0.023474 .3 0.046948 3.1 0.004695 .06 0.075117 1.4 0.004695 .2 0.070423 2.3 0.004695 1.2 0.018779 24.8 0.004695 .45 0.004695 3.3 0.004695 5.3 0.004695 4.7 0.004695 3.4 0.014085 .4 0.042254 2.5 0.009390 .9 0.018779 .8 0.023474 5 0.004695 5.2 0.004695 1.8 0.004695 1.3 0.009390 1.9 0.004695 1.7 0.004695 6.3 0.004695 .7 0.014085 23.6 0.004695 1.5 0.009390 11 0.004695 1 0.018779 11.5 0.004695 .6 0.014085 13.1 0.004695 3.6 0.004695 2.9 0.004695 1.6 0.004695 17.8 0.004695 1.1 0.009390 25.9 0.004695 5.6 0.004695 3.2 0.004695 6.5 0.004695 13.5 0.004695 14.3 0.004695 Name: proportion, dtype: float64 counts for ALCOHOL CONSUMPTION ALCCONSUMPTION .03 1 7.29 1 .69 1 10.17 1 5.57 1 .. 7.6 1 3.91 1 .2 1 3.56 1 4.96 1 Name: count, Length: 181, dtype: int64 percentages for ALCOHOL CONSUMPTION ALCCONSUMPTION .03 0.004695 7.29 0.004695 .69 0.004695 10.17 0.004695 5.57 0.004695

7.6 0.004695 3.91 0.004695 .2 0.004695 3.56 0.004695 4.96 0.004695 Name: proportion, Length: 181, dtype: float64

Analyze: The table shows that for income per person, alcohol consumption , every category has unique values with country being the unique identifier for each row. In general HIVRATE has commonality with a low rate. I am hoping to correlate all this data together to draw conclusions.

0 notes

perimichael · 8 months ago

Text

Data Analytics Project

The data set I have selected is coming from the National Epidemiologic Survey of Drug Use and Health dataset.

My question is, is there a correlation between drug use and depression within the dataset. I believe there will be a positive correlation between the two datasets.

In google scholar, I searched "drug abuse rates vs major depression "and found the following article: drug abuse rates vs major depression - Google Scholar and it makes me think that my hypothesis will be correct.

The article states the following: or example, the crude lifetime prevalences for depression ranged from 0.7 to 8.6 per 100; for alcohol it was 7.5 to 32.6, and for drugs it was 3.1 to 10.5

Which makes me think drugs are not the highest cause of depression but are a contributing factor

1 note · View note