#cause for once I feel like I can actively and publicly indulge in my silly joys without fear of judgement | Explore Tumblr posts and blogs

because-its-important · 8 years ago

Text

what’s the most annoying question to ask a nun* in 1967?

tl;dr - In 1967, a very long survey was administered to nearly 140,000 American women in Catholic ministry. I wrote this script, which makes the survey data work-ready and satisfies a very silly initial inquiry: Which survey question did the sisters find most annoying?

* The study participants are never referred to as nuns, so I kind of suspect that not all sisters are nuns, but I couldn't find a definitive answer about this during a brief search. 'Nun' seemed like an efficient shorthand for purposes of an already long title, but if this is wrong please holler at me!

During my first week at Recurse I made a quick game using a new language and a new toolset. Making a game on my own had been a long-running item on my list of arbitrary-but-personally-meaningful goals, so being able to cross it off felt pretty good!

Another such goal I’ve had for a while goes something like this: “Develop the skills to be able to find a compelling data set, ask some questions, and share the results.” As such, I spent last week familiarizing myself with Python 🐍, selecting a fun dataset, prepping it for analysis, and indulging my curiosity.

the process

On recommendation from Robert Schuessler, another Recurser in my batch, I read through the first ten chapters in Python Crash Course and did the data analysis project. This section takes you through comparing time series data using weather reports for two different locations, then through plotting country populations on a world map.

During data analysis study group, Robert suggested that we find a few datasets and write scripts to get them ready to work with as a sample starter-pack for the group. Jeremy Singer-Vines’ collection of esoteric datasets, Data Is Plural, came to mind immediately. I was super excited to finally have an excuse to pour through it and eagerly set about picking a real mixed bag of 6 different data sets.

One of those datasets was The Sister Survey, a huge, one-of-its-kind collection of data on the opinions of American Catholic sisters about religious life. When I read the first question, I was hooked.

“It seems to me that all our concepts of God and His activity are to some degree historically and culturally conditioned, and therefore we must always be open to new ways of approaching Him.”

I decided I wanted to start with this survey and spend enough time with it to answer at least one easy question. A quick skim of the Questions and Responses file showed that of the multiple choice answer options, a recurring one was: “The statement is so annoying to me that I cannot answer.”

I thought this was a pretty funny option, especially given that participants were already tolerant enough to take such an enormous survey! How many questions can one answer before any question is too annoying to answer? 🤔 I decided it’d be fairly simple to find the most annoying question, so I started there.

I discovered pretty quickly that while the survey responses are in a large yet blessedly simple csv, the file with the question and answers key is just a big ole plain text. My solution was to regex through every line in the txt file and build out a survey_key dict that holds the question text and another dict of the set of possible answers for each question. This works pretty well, though I’ve spotted at least one instance where the txt file is inconsistently formatted and therefore breaks answer retrieval.

Next, I ran over each question in the survey, counted how many responses include the phrase “so annoying” and selected the question with the highest count of matching responses.

the most annoying question

Turns out it’s this one! The survey asks participants to indicate whether they agree or disagree with the following statement:

“Christian virginity goes all the way along a road on which marriage stops half way.”

3702 sisters (3%) responded that they found the statement too annoying to answer. The most popular answer was No at 56% of respondents.

I’m not really sure how to interpret this question! So far I have two running theories about the responses:

The survey participants were also confused and boy, being confused is annoying!

The sisters generally weren’t down for claiming superiority over other women on the basis of their marital-sexual status.

Both of these interpretations align suspiciously well with my own opinions on the matter, though, so, ymmv.

9x speed improvement in one lil refactor

The first time I ran a working version of the full script it took around 27 minutes.

I didn’t (still don’t) have the experience to know if this is fast or slow for the size of the dataset, but I did figure that it was worth making at least one attempt to speed up. Half an hour is a long time to wait for a punchline!

As you can see in this commit, I originally had a function called unify that rewrote the answers in the survey from the floats which they'd initially been stored as, to plain text returned from the survey_key. I figured that it made sense to build a dataframe with the complete info, then perform my queries against that dataframe alone.

However, the script was spending over 80% of its time in this function, which I knew from aggressively outputting the script’s progress and timing it. I also knew that I didn’t strictly need to be doing any answer rewriting at all. So, I spent a little while refactoring find_the_most_annoying_question to use a new function, get_answer_text, which returns the descriptive answer text when passed the answer key and its question. This shaved 9 lines (roughly 12%) off my entire script.

Upon running the script post-refactor, I knew right away that this approach was much, much faster - but I still wasn’t prepared when it finished after only 3 minutes! And since I knew between one and two of those minutes were spent downloading the initial csv alone, that meant I’d effectively neutralized the most egregious time hog in the script. 👍

I still don’t know exactly why this is so much more efficient. The best explanation I have right now is “welp, writing data must be much more expensive than comparing it!” Perhaps this Nand2Tetris course I’ll be starting this week will help me better articulate these sorts of things.

flourishes 💚💛💜

Working on a script that takes forever to run foments at least two desires:

to know what the script is doing Right Now

to spruce the place up a bit

I added an otherwise unnecessary index while running over all the questions in the survey so that I could use it to cycle through a small set of characters. Last week I wrote in my mini-RC blog, "Find out wtf modulo is good for." Well, well, well.

Here’s what my script looks like when it’s iterating over each question in the survey:

I justified my vanity with the (true!) fact that it is easier to work in a friendly-feeling environment.

Plus, this was good excuse to play with constructing emojis dynamically. I thought I’d find a rainbow of hearts with sequential unicode ids, but it turns out that ❤️ 💙 and 🖤 all have very different values. ¯\_(ツ)_/¯

the data set

One of the central joys of working with this dataset has been having cause to learn some history that I’d otherwise never be exposed to. Here’s a rundown of some interesting things I learned:

This dataset was only made accessible in October this year. The effort to digitize and publicly release The Sister Survey was spearheaded by Helen Hockx-Yu, Notre Dame’s Program Manager for Digital Product Access and Dissemination, and Charles Lamb, a senior archivist at Notre Dame. After attending one of her forums on digital preservation, Lamb approached Hockx-Yu with a dataset he thought “would generate enormous scholarly interest but was not publicly accessible.”

Previously, the data had been stored on “21 magnetic tapes dating from 1966 to 1990” (Ibid) and an enormous amount of work went into making it usable. This included both transferring the raw data from the tapes, but also deciphering it once it’d been translated into a digital form.

The timing of the original survey in 1967 was not arbitrary: it was a response to the Second Vatican Council (Vatican II). Vatican II was a Big Deal! Half a century later, it remains the most recent Catholic council of its magnitude. For example, before Vatican II, mass was delivered in Latin by a priest who faced away from his congregation and Catholics were forbidden from attending Protestant services or reading from a Protestant Bible. Vatican II decreed that mass should be more participatory and conducted in the vernacular, that women should be allowed into roles as “readers, lectors, and Eucharistic ministers,” and that the Jewish people should be considered as “brothers and sisters under the same God” (Ibid).

The survey’s author, Marie Augusta Neal, SND, dedicated her life of scholarship towards studying the “sources of values and attitudes towards change” (Ibid) among religious figures. A primary criticism of the survey was that Neal’s questions were leading, and in particular, leading respondents towards greater political activation. ✊

As someone with next to zero conception of religious history, working with this dataset was a way to expand my knowledge in a few directons all at once. Pretty pumped to keep developing my working-with-data skills.

#recurse center #data analysis #religion #data #python #pydata #programming #silliness

2 notes · View notes