#you know how in (i think it was) set notation/ set theory for math | Explore Tumblr Posts and Blogs

Visit Blog

Explore Tumblr blogs with no restrictions, modern design and the best experience.

Fun Fact

The Tumblr app for Google Glass was released on May 16, 2013.

#you know how in (i think it was) set notation/ set theory for math

wordsaficionado · 1 month

Text

I think a great piece of advice for people who want to learn, but don’t know what to learn, is to take advantage of Wikipedia.

However, the second half of this advice is use that specific feature where some words are highlighted, and fall into rabbit holes. The best part about this is you don’t have to have anything super specific to search.

Say, for example, I was in the mood for some ELA type stuff, but I don’t know what kinds of stuff there is. I simply go on Wikipedia and search English. Admittedly, that doesn’t bring up much, but there IS a blue link to a page about the English language.

From there, I can click on things like “Indo-European language family” and “early Medieval England” and “British Empire” if I’m in the mood for more history things. Or, if I really want to learn about words, I can click the vocabulary section, or the phonology section, or the orthography section, or even the grammar section.

There’s so much to be learned in this one broad category, and it’s the same for every other subject you learn in school, which is a great place to start.

Math (mathematics)? Empirical sciences, number theory, and set theory are just in the opening paragraph. Not to mention the sections titled “relationship with astrology and esotericism” and “symbolical notation.”

History? Well, from history we can get to History of Earth, and it’s not hard to guess how much is there. There’s also anything that could be the History of; History of Mankind, History of Dinosaurs, History of Philosophy, literally pretty much anything you could want to learn about is on Wikipedia.

The point is, Wikipedia is an amazing tool and source of knowledge. This strategy is a great way to actually access that knowledge. Have fun!

#also non fiction yet well written books are a game changer for me #wikipedia #Wikipedia supremacy #school #history #study tips #sentence structure #writing #advice #mathematics #English #knowledge is SO cool #knowledge #knowledge is power

2 notes · View notes

thebrilliot · 4 months

Text

Cube Turns

After figuring out the group of the rotations of a cube, the next step is to combine into turns on a Rubik's cube. From now on, when I refer to rotations, I'm most likely talking about the orientations of cubelets (there will be exceptions), and when I refer to turns, I mean turning one of the faces of the cube. And then I'm also going to coin the term 'move' but the difference between a turn and a move will take a little more explaining.

Again, I did something for my own selfish reason of wanting to make the coding simpler. I've been rethinking this choice, but I came up with the concept of a "move" to avoid my own annoyance. There are 18 commonly used actions you can take on a Rubik's cube. R for a clockwise turn on the right face, U (up), F (front), L (left), B (back), and D (down). Well, if you want to efficiently search a space, you want want to do your best not to revisit any state pointlessly, and sure, you can turn the right face and then make sure to the next turn in the sequence is not the right face, but what's to keep you turning the right face, then the left, then the right again? That would be pointless. The left and right faces don't interact with each other. You could alternate between the two and you would only ever visit 16 distinct states. So I have been focusing on "moves" instead of turns which gives the number quarter-turns of both faces orthogonal to the axis of rotation, i.e. X12 is equal to L R2. Also, clockwise and counterclockwise are explained in another blog post. X01, X02, ..., X10, X11, ..., X21, ..., X33, Y01, and so on. There are 45 moves in all as opposed to 18 turns.

I've been using moves to explore the space since the beginning. Let me know if you like or hate the idea. I will probably do this all over with turns instead of moves once I'm satisfied with how things go. The more I talk about this, the more I realize how stupid it is, but I'm going to lie in this bed I've made.

Personal tangent aside, exploring depth-first is extremely easy this way. Just loop through all 45 moves, and then all 30 moves on a different axis from the previous move. Accordingly, so far each level of depth has about 29-30 times the number of states as the previous level. The number of states increases extremely quickly which is why I haven't gotten past depth 5.

I do have a plan to increase the storage efficiency using conjugacy sets (I remembered the name!) for which using the move concept should be useful. If you are an avid cuber, this should sound similar to a conjugate. A conjugate is an algorithm that is bookended by a sequence of turns and their inverse. For example, L U F U’ F’ L’ is a conjugate because it starts with L and ends with its inverse L’. This is actually the same exact definition of a conjugate in group theory too. I'm using it to mean something slightly different, which just goes to show again that I'm not a mathemetician. I am using conjugate to mean "rotate the whole cube, then make the turns, then rotate the cube back". The conjugacy set would be the set of all move sequences (and therefore cube states) that can be achieved by rotating the cube before and then rotating it back after and their inverses. In group theory, the inverse is not always a part of the conjugacy set, and I don't think it is in this case but I know there's a way to do it so I'll do it. The "move" concept I have created lends itself quite well to building a conjugacy set like this because instead of actually rotating the cube, performing the moves, and rotating the cube back, I've found a convenient rule for rotating the move sequence itself instead.

I do hope to make this publicly available when it's far enough along. For now, I keep the code in a private repository. Eventually though, I will add the normal turn notation so that people can use what's familiar. I'd also love to make animations if possible. I'm a huge admirer of 3Blue1Brown (AKA Grant Sanderson) and other math and code YouTube educators. I am so new to that kind of work that I have no idea how to go about using the tools. If anyone knows, I'm game to try learning.

#rust #codeblr #rustacean #programming #python #rubikscube #ai #mathblr #mathematics #group theory #animation #animated video #SoME

2 notes · View notes

inclusion-toolkit · 1 year

Text

Module 5: Mathematics

Reflection:

It surprises me how many adults tell me that Math was their least favourite subject in school. I am not a natural mathematician, but I was never intimidated by it. My father was my first Math teacher. I remember how he would tell me stories of his day at work and embed computational questions that I would willingly answer, not knowing I was actually being taught simple arithmetic. I think what his storytelling did for me was humanize mathematics. It was not a detached, foreign concept, but a part of everyday life. In hindsight, I also used my father’s pedagogical style when I was a classroom teacher. I taught English Language Learners, and I often incorporated new vocabulary into stories so that they may relate to the words in a more personal level. Nowadays, teachers have access to so many resources that they do not have to think of stories on the fly. I remember seeing a classroom teacher read a book titled, Two of Everything, by Lily Toy Hong. This book can easily be an introduction to multiplication, and there are many more books like this.

Source: Goodreads

Focus Questions:

What math skills do students continue to struggle with?

From what my students have shared with me, long division seems to be a contentious issue. My theory is that it is confusing because the way it is notated is unfamiliar to the students. It is the only mathematical process that utilizes a tableau to start the division process instead of / or ÷. Looking back, I will not be able to explain why long division works. I just know that it does.

What supports and strategies could you use to support students with lagging math skills?

Aside from one-on-one in-class support, we are fortunate in our school to offer a Learning Buddies Network Math tutoring program every Thursday after school. My students who are registered in this program have seen a dramatic improvement in their Math scores as well as their interest in the subject. Other ways to support students include practicing with manipulatives, using mnemonics, and presenting Math problems in multimodal ways such as pictures, diagrams, and charts instead of equations.

How does literacy affect math learning?

Some areas of Math, e.g. word problems, are highly dependent on literacy skills due to the nature of the task. Math textbooks also use a lot of unfamiliar terms like “subtract” and “carry.”

What support do you have for struggling readers?

Practice with manipulatives can build important prior knowledge about a Math concept. The more students know about the concept, the better they can comprehend and solve a problem related to it. Teachers can also represent problems multimodally instead of a written manner, such as online math games, which can capitalize on a student’s visual learning strengths

How do you deal with students who read well, but struggle with literacy in math problems?

Like I have mentioned before, Math uses a whole different set of terms that students may be unfamiliar with. Thus, teachers should select vocabulary carefully, simplify sentences, repeat key words often, and perhaps explicitly teach Math terminologies. Another way that can help students who may be overwhelmed by a complex word problem is by separating a problem into its subparts and working through each one. This can help students focus, see connections, and avoid memory overload.

What are some different ways to assess math learning?

Aside from traditional quizzes and worksheets, I have observed classroom teachers using online games to keep track of student progress. Some teachers also have Math journals and KWL charts. Of course, nothing takes the place of observation and conversation to see how students interact and understand the concepts.

How do you record learning while it’s happening?

I have seen teachers use post-its with a couple of checklists that they just tick off during the lesson while observing and interacting with the students during the lesson. During my teacher education program, we were also encouraged to use exit tickets as assessment. For instance, each student will be given a post-it where they can create their own subtraction equation.

Is all your math assessment summative?

Not at all. The online games the students play, the Math journals and KWL charts, the teacher observation checklists, and the exit tickets are all types of formative assessments.

How can you add more assessment for learning and as learning to your pedagogy?

Teacher interaction and observation is a valuable assessment format. Having conversations with students, whether informal or formal like an interview, are an excellent way to assess students' understanding of a concept.

Two or three important ideas that you took away from the module

Literacy skills can be an important precursor to mathematical performance.

I learned a lot about the Provincial Graduation Numeracy Assessment (GNA). I found out that students have the opportunity to retake the GNA three times between Grade 10 and Grade 12 to improve their proficiency score: (1) Emerging, (2) Developing, (3) Proficient, (4) Extending. Only their best score will be counted as their final result.

At least one useful resource

DreamBox Learning's Math Program, an online, Intelligent Adaptive Learning™ an advanced artificial intelligence that personalizes learning with supplemental curriculum.

At least one useful strategy

Math Daily 3 from the The DailyCAFE (I have not seen a classroom teacher use this. The Daily 5 is quite common, though).

INA Assessments by Island Numeracy Network (I have heard of this, but did not know much about it until now).

0 notes

fencesandfrogs · 4 years

Text

a follow up from this post where i talk about math and me as a kid.

Wait you have dyscalculia but are a math major? Wow I have dyscalculia but I like the philosophy of math I guess like I sorta forced myself to get into it to learn, I feel I can do basic so for me it's mainly the math anxiety

@totallysweetheart

tl/dr: the part of my brain that deals with abstract/tangible is, i think, broken, because i can’t deal with numbers as real things, but i can do that with polynomials or w/e.

so to summarize, based on wikipedia’s list of dyscalculia symptoms, here is me:

analog clocks: i’m fine to 15min in real life where i know the time of day, but in a vacuum, most real clocks r tricky. doesn’t come up. the teaching clocks i’m usually fine with because the minute hour hands are really distinct.

larger numbers: depends on presentation. purely verbally? no. visually? depends. if they both start with the same number it’s harder.

sequencing issues: not really.

financial planning: bank accounts are black magic and my mom still manages mine. i err on the side of frugal, which lead me with like 50% of my college meal plan unspent last semester.

visualizing numbers: no. nope. can’t do. not at all. numbers r fake.

arithmetic: it sucks, a lot. i’m better at multiplying and adding, and it’s gotten better because i did a lot of practice a few years ago, but i still prefer calculators.

number writing difficulties: yeah? hard to say i’ve been doing algebraic stuff for a long time and that really cuts down on the number of places to make those kinds of mistakes.

concepts and practice: this is where i’m strongest. my math conceptual game is strong as hell, and i don’t usually struggle with putting it into practice. even word problems i’m pretty strong at because like. it’s just math.

names of numbers: not really an issue.

left/right: also not really an issue. although it takes me a second.

spatial awareness: doesn’t exist. just. doesn’t. people don’t believe me then they ask me how long something is and i say like three feet and they’re like “it’s taller than you” and i’m like “oh really? huh the more you know”

time: im timeblind af. also adhd tho so that doesn’t help.

maps: ehhhhh. hard to say. I’m okay with some parts of maps but not others. this has definitely improved since school.

working backwards in time: i have an app for that its beautiful and i love it

music: i am good at music notation. not great at rhythm but i’m good at music in general.

dance: i did 12 years of dance. i’m not amazing, but it was a nonissue.

estimation: see: time, spatial awareness (the answer is i cannot)

remembering formulas, etc: i’m usually good at remembering this stuff.

concentration: adhd already so? maybe?

faces, names: i do not do very well here.

so like. i basically have the best possible set of symptoms to become a math major. i kind of skirted attention as a kid because i could get around a lot of my difficulties and didn’t really have anything to do but use brute force to cram multiplication facts into my head.

and because i had this really strong conceptual understanding, i just sort of survived until algebra. at which point i was very happy.

because basically most of my dyscalculia issues revolve around numbers and the real world. i can’t do time, i can’t estimate, i can’t really work with numbers. but i can work with algebra because the concepts were fine. there was just a road block.

for me, it’s kind of like having a major speech disorder in your native language. speech in the your mouth doesn’t work, not the language issues. as a kid i loved writing because the words came out the way they were in my head. they didn’t get shuffled and mangled. and that’s also how i felt about algebra. like, look! you don’t have to worry about getting the numbers right if you can move the variables around,

and obviously it’s not that complicated because i’m skipping basically from fifth grade to my junior year of high school, but even though it was a constant friction between me and everyone about why i kept making careless mistakes, even after other adhd stuff got treatment, it was generally acknowledged that i knew what i was doing, so i never really developed math anxiety.

and as a math major, like, numbers are not a very large part of what you do. i use wolfram alpha a lot for solving that sort of thing. i do stuff that’s more about the logic parts of math. lil puzzles waiting to be solved.

it really does feel kind of like the abstract and tangible parts of my brain were swapped. because numbers really do feel abstract, but figuring out the equations of a graph is a fun game to play with friends. i usually get the constants wrong, but that’s besides the point.

i’m not entirely sure if this was helpful and/or clarifying in any way. if asked, i will usually not mention dyscalculia because? it just doesn’t feel very relevant/serious. because my management strategy is: don’t do anything with numbers and estimation ever. and then that works, because i don’t have to. it’s only really relevant in the context of me, a child, very confused about why those centimeter cubes exist, etc.

and also, as i got older, i dug more and more into theory and proofs. learning about numbers as entities that follow rules was a really useful thing for me. learning about negative numbers made subtraction easier for me because it wasn’t addition in reverse, it was addition of a negative number. which made more sense to me.

i struggled in high school geometry because of all of the numbers and angles (i have a shirt somewhere that says “all i learned in geometry is that you can’t measure shapes”) and every time someone pointed out applications to me i kind of just went “okay but there are rulers for that”

and i do like geometry! i like how we can build properties out of simple rules and how shapes behave and its really cool you only need like 5 postulates to build a lot of geometry but if you make me deal with too many angles and i want to cry

so yeah. uh. i’m a math major & it works because when we deal with numbers, they’re almost variables in themselves? like okay we’re going to use 0 and 1 here to apply this theorem but the numbers themselves aren’t relevant.

here is a screenshot from my calc textbook, if this helps make my point. most of these concepts are things i can just. put in my head and hold the way people who can think about numbers describe numbers to me.

i have no idea where u are in ur life but if u like math from the logic side, then pure math exists and its p cool. usually you gotta get thru calculus, and then take a course in proof writing (at my uni it’s called “transition to advanced math”) at which point everything turns into theorems and proofs and the most number intensive course is probability. i don’t even need statistics credits to graduate.

this was a lot and i tried to wrap it up like 3 times and then i had more to say because i think a lot abt math and the fact that i was lucky to have the right opportunities to not entirely chase myself away from the field (which is a lot more words and i should probably work on my hw) but if u have more questions lemme know bc! i am very dedicated to exposing people to math and why i love it.

#dyscalculia #twice exceptional #university #math #mine #txt #25th #January #2021 #January 25th 2021 #ask #not really but also yes #learning disabilitiies #essay

17 notes · View notes

studyradius · 6 years

Text

I wrote this for the personal diary, but I believe it’s also appropriate here.

As I have finished a textbook on the beginnings of the naiive set theory... let's summarize, and embrace the bigger picture. 1. school math very much $\ne$ all other (normal? decent?) math, and I love that, though am also terrified (pleasantly so); 1.1 which may also have to do with what someone has said on MSE, something in the fashion of 'set theory is highly abstract, more so than some other areas of mathematics'; 2. commutation and associativity can be entrusted no longer; 3. all in all, MSE and math.hashcode.ru are nice and valuable resources (though, as any other streams of constant new materials, can tap into the neural pathways in a destructive, wearing-out-of-the-reward-circle manner; also, I just don't understand a proof written in a foreign language sometimes); 4. feeling blessed given access to the very foundations of the queen of all sciences... something a wanted to do a long time ago... peeking under the math's skirt; 5. unfortunately, I lack persistence/tenacity/intelligence to understand some concepts/proofs and do half of the problems; 6. using MathJax and LaTeX feels very rewarding, think I could work as a typesetter and never complain, haha; 7. I feel like I could write something technical (did you know R can be well-ordered? how do you feel about Zorn's lemma? P(n)=>P(lim ord), anyone?); feels wrong at the same time, like if that would mean I'm showing off--math is not about learning half-understood theorems by heart, it's about solving problems in a logically correct, rigourous, elegant manner (which to me was hard, I don't do enough problems!). All in all, let's //choose// the following summarizing one-liner: the attempt to learn elementary set theory INCREASED my MATHEMATICAL MATURITY, while at the same time DECREASING my INTELlectual abilities/problem solving skills/whatever you call that. I don't know how much sense that makes. Are these mutually exclusive? Am I deluding myself? Have you ever felt after a long time of [procrastination/doing nothing/not burdening your mind with intellectual work] strictly physically like if the connections in your brain are starting to disconnect, and are gradually fading away, leaving you stupider/dumber than you were? That's how I feel right now. I am also grateful I was exposed to proofs that use 80% human words and 20% mathematical notation--a complete reverse of what they do in school. I don't know why I am still keeping a diary on a dead platform, where nobody is reading this. Perhaps. Okay, bye!

#mathblr #math #set theory #mathematics #long post

6 notes · View notes

mathematicianadda · 5 years

Text

The importance of pretty pictures

I think math has a lot of aesthetic beauty. There's certainly beauty available to those who have a strong understanding of it. Mathematicians call proofs and theorems "beautiful" even when they're nothing more than words on paper.

But there's also a simpler, more realistic form of beauty that comes from mathematics. I'm talking about the pretty pictures you find in textbooks and off to the margins in articles. Things like this and this and this one I made myself to illustrate how SL2(Z) acts on the upper half-plane. Anybody can appreciate these, regardless of whether they know the math behind them or even how to read and write. Anybody with a soul would call these beautiful. And I think these serve a very fundamental role in mathematics.

There's a lot of other problems with the way we teach math in schools, but one of the biggest is that we suck all the beauty out of it. In A Mathematician's Lament, Paul Lockhart compared it to having to study staff notation for ten years before you're allowed to listen to Bach. Math pedagogy is so worried about teaching details and notation, to the point that we can often forget we're studying things that have roots in the physical world around us, which can and should be easily visualized. When you can learn about immortal jellyfish and mushrooms the size of Texas in intro biology classes, play with Van Der Graaf generators and double slit experiments in AP Physics, and watch sodium react explosively to water in grade school chemistry classes, why don't you get to draw complete graphs and stare at Julia sets until you've mastered tons of notation? Why does math have to be the only not-fun natural science taught in grade school?

The reason I became interested in analytic number theory when I was in high school is because of something that happened when I was really little. I went to a math club thing with my dad at a nearby university, and afterward I checked out the library and curiously picked up a random book that I think was about modular forms. There was a picture like this and it was so fascinating and amazing to me that I just had to learn more. How was it drawn, what does it mean, and what does it have to do with number theory? Everybody who seriously studies math has a similar story, of a beautiful illustration they saw or a cool trick they learned that made them curious. Pretty pictures are probably, in some way or another, the reason most of us are on this subreddit right now.

submitted by /u/doom_chicken_chicken [link] [comments] from math https://ift.tt/2XdvHYm from Blogger https://ift.tt/2RJ7J60

#Blogger

0 notes

thebrilliot · 4 months

Text

Rubik's cube solving agent: Road Map

K, haven't done stuff in a little bit; I wasn't sure what to do first. BUT I did do a lot of studying group theory, and I do really appreciate when someone else has already established a vocabulary for a system of thinking. It does apply very well to Rubik's cubes.

So I'm thinking that there is enough that I want to change that I will be better off refactoring right now. The code base isn't even large at this point but I do know some of things that I want to accomplish with this project, and I want to make quick progress on this. I have other projects I want to get to too! Making progress on this problem I first started on 12 years ago will feel sooooo good. Gotta keep moving forward. More iteration = more speed even if that requires rewriting code more often.

That means I'm going to make a new branch for v0.2! Once I have v0.2 where I want it, I'll make the repo public. I think I will be comfortable with that. I have been reading about words and normal form and generating sets and Cayley diagrams and things. If you watch Matthew Macauley's Visual Group Theory lectures on YouTube, he talks about a "Big Book" that contains the shortest solution for every Rubik's cube and I'm borrowing that idea for what I previously called a "Store". Here is the rough road map for the project now:

v0.1 (there is a cube and you can turn it)

completed lol

v0.2 (flexible and (mostly) optimized book generation)

better and alternative notation, position vs. ID relative cube representation, enable flexible book creation via CLI, maximize book size using embedded DB, cube and word normal and compressed serialization

v0.3 (usable for anyone)

Python interface using pyo3, TUI improvement, colored facelet representation, strategic-game-cube dataset compatibility, word reduction and substitution

v0.4 (suitable for training Rubik's cube agents)

books as training datasets, 4 trainable tasks - (valid cube identification, masked cubelet prediction, depth prediction, cube solving RL)

v0.5 (nice-to-haves, idk)

algorithm exploration in TUI?, conjugacy classes (at least I'll give it a try), 3D viewer???, I do think it would be cool to watch the cube as the agent tries to solve the cube

I swear this will all make more sense when you can see the repo! I wish I had stuff to show , you know? Unfortunately, that's not how studying works, and my notes and thoughts are a mess. Hey, let me know if you would want to watch me work! The fun of this for me is that there is literally no point to it, lol. I'm making this for me. It will bother me if I don't finish yet another project and leaving the problem itself unsolved is already bothering me. If you want to join me in the fun, I will gladly show you my mad ravings and rant about how this entire project is just to enable a single experiment see if, when you're training a model to navigate a space that can be represented by permutations, the permutations will show up in the embeddings.

I don't stream much but I have a Twitch account! What do you think?

#rust #codeblr #rustacean #programming #python #rubikscube #ai #group theory #mathblr #twitch #streaming

1 note · View note

careergrowthblog · 5 years

Text

Your curriculum defines your school. Own it. Shape it. Celebrate it.

At the Heads’ Roundtable event this week I was making a pitch for school leaders to get stuck into a deep curriculum review process – as many already have. Not because of the expectations of whatever accountability process is underway, but because it matters so much. To a degree that is underplayed all too often, I would suggest that schools are fundamentally defined by their curriculum.

Every school has its motto – those value statements emblazoned on every letterhead, every blazer, above the entrance… Respect. Courage. Resilience. Ambition. Compassion. Fortiter Ex Animo. Carpe Diem. But these grand ideas only take form in the context of students doing things, learning things, experiencing things, receiving messages about things.. actual things that you have decided on. And those things are your curriculum; the actual tangible real-life curriculum that is enacted across the days, weeks and years of a life in your school.

As I have explored in some detail in this post, 10 Steps for Reviewing Your KS3 Curriculum, there is a process that applies to any curriculum review, starting with getting to know your school curriculum as it is. I’m suggesting that school leaders make sure they have developed a strong set of principles around curriculum design and content informed by exploring their own curriculum and a range of alternatives. What do you believe about what your children should learn across the curriculum? Often leaders are only relatively expert in a narrow set of curriculum ideas – we’ve all largely been trained in specific knowledge domains so it’s difficult to know what the possibilities are; to know what an excellent curriculum might look like in every area. But, whilst it may always be necessary to defer to the expertise of others – inside and outside your school – it pays to get into the detail, to begin to learn about each area and develop some reference points.

Here are some questions you might want to ask yourself about your school curriculum:

Maths: How is the maths curriculum organised and sequenced? Is it logical, appropriately designed for mastery, vertical coherence. Where does it start? Number? Sequences? Why does it start there? Who decided? Do they know why?

English: Which books will students read in each year group? If you don’t know, this is a good place to start. What’s the rationale for each one and the overall sequence? Which books have been left out.. who decided? Do teachers choose or is it a departmental approach? Overall, what’s the range of genres, the balance of ‘classics’ vs contemporary fiction? Is the selection something you feel is bold, interesting, building a secure foundation in the literary canon for future reading, opening doors to the world of literature, challenging and demanding as well as inspiring and engaging?

History: Does your history curriculum provide your students with the knowledge and understanding you’d hope for given who they are and where you live- including those who select if for GCSE and those that don’t? What does your curriculum say about your priorities – it is balanced well between UK history and world history, a range of historical periods, a range of types of events – power and politics, social history, wars? Does it allow for alternative perspectives and some depth studies alongside a broad factual overview of events and key figures – the facts that every child should know?

The same questions for English and History apply to art and music and RS. Which artists do they meet? When do students learn about Islam? What do they learn about Islam? Which style of music curriculum do we offer: composition-focused with a contemporary slant or more classical with a strong strand of theory, notation and music history?

In Science- are you confident that the curriculum is designed bottom-up with key concepts – particles, energy, cells – embedded and built-upon. What’s the general experience in relation to practical work and hands-on learning? Will students grow those plants or just label diagrams of them in theory? Will they ever design an experiment of their own? Will the Geography and Science curriculum links be strong so that your students definitely all gain a very strong foundation of knowledge about climate change and sustainability? What exactly will they learn? And when?

Across the curriculum, where do students get to develop their oracy skills, to extend their knowledge of their local community, to mentally travel the world, to physically get into the countryside or the city, to see museums, to gain cultural experiences, to hear the Holocaust story, to discuss homophobia and racism, to learn about sex and relationships, to make things and be creative, to engage in an extended learning project of some form, to make a choice about the way they communicate their ideas?

Across the curriculum are you satisfied that it is challenging enough? Challenge is a vague notion until you tie it down to some specific curriculum choices. What’s the diet like in Year 5 and Year 8? Are your highest attainers truly stretched? Is there any padding, filler, soft, weak meandering when a bit more rigour might be more appropriate? Where you’ve had to make compromises because of the constraints of time and resources – are you happy you’ve arrived at the best possible balance between competing choices?

All of these questions can be answered. And then you have to decide what you think. Is it a good curriculum? Are there different, better choices you could make? Is it a curriculum you feel proud of – that represents the school you want to run and be part of? Because this is what your school actually is. Your curriculum is your school. So, to the greatest extent possible, it pays to own it, to shape it and to celebrate it. It’s so powerful if you can be on the front foot before anyone else comes along to test it out. Here’s our curriculum: This is what we are. This is what we do. And we’re proud of it.

Your curriculum defines your school. Own it. Shape it. Celebrate it. published first on https://medium.com/@KDUUniversityCollege

0 notes

chwindolf · 6 years

Text

On statistical learning in graphical models, and how to make your own [this is a work in progress...]

I’ve been doing a lot of work on graphical models, and at this point if I don’t write it down I’m gonna blow. I just recently came up with an idea for a model that might work for natural language processing, so I thought that for this post I’ll go through the development of that model, and see if it works. Part of the reason why I’m writing this is that the model isn’t working right now (it just shrieks “@@@@@@@@@@”). That means I have to do some math, and if I’m going to do that why not write about it.

So, the goal will be to develop what the probabilistic people in natural language processing call a “language model” – don’t ask me why. This just means a probability distribution on strings (i.e. lists of letters and punctuation and numbers and so on). It’s what you have in your head that makes you like “hi how are you” more than “1$…aaaXD”. Actually I like the second one, but the point is we’re trying to make a machine that doesn’t. Anyway, once you have a language model, you can do lots of things, which hopefully we’ll get to if this particular model works. (I don’t know if it will.)

The idea here is that if we go through the theory and then apply it to develop a new model, then maybe You Can Too TM. In a move that is pretty typical for me, I misjudged the scope of the article that I was going for, and I ended up laying out a lot of theory around graphical models and Boltzmann machines. It’s a lot, so feel free to skip things. The actual new model is developed at the end of the post using the theory.

Graphical models and Gibbs measures

If we are going to develop a language model, then we are going to have to build a probability distribution over, say, strings of length 200. Say that we only allowed lowercase letters and maybe some punctuation in our strings, so that our whole alphabet has size around 30. Then already our state space has size 20030. In general, this huge state space will be very hard to explore, and ordinary strategies for sampling from or studying our probability distribution (like rejection or importance sampling) will not work at all. But, what if we knew something nice about the structure of the interdependence of our 200 variables? For instance, a reasonable assumption for a language model to make is that letters which are very far away from each other have very little correlation – $n$-gram models use this assumption and let that distance be $n$.

It would be nice to find a way to express that sort of structure in such a way that it could be exploited. Graphical models formalize this in an abstract setting by using a graph to encode the interdependence of the variables, and it does so in such a way that statistical ideas like marginal distributions and independence can be expressed neatly in the language of graphs.

For an example, let $V={1,\dots,n}$ index some random variables $X_V=(X_1,\dots,X_n)$ with joint probability mass function $p(x_V)$. What would it mean for these variables to respect a graph like, say if $n=5$,

In other words, we’d like to say something about the pmf $p$ that would guarantee that observing $X_5$ doesn’t tell us anything interesting about $X_1$ if we already knew $X_4$. Well,

Def 0. (Gibbs random field, pmf/pdf version) Let $\mathcal{C}$ be the set of cliques in a graph $G=(V,E)$. Then we say that a collection of random variables $X_V$ with joint probability mass function or probability density function $p$ is a Gibbs random field with respect to $G$ if $p$ is of the form $$p(x_V) = \frac{1}{Z_\beta} e^{-\beta H(x_V)},$$ given that the energy $H$ factorizes into clique potentials: $$H(x_V)=\sum_{C\in\mathcal{C}} V_C(x_C).$$

Here, if $A\subset V$, we use the notation $x_A$ to indicate the vector $x_A=(x_i;i\in A)$.

For completeness, let’s record the measure-theoretic definition. We won’t be using it here so feel free to skip it, but it can be helpful to have it when your graphical model mixes discrete variables with continuous ones.

Def 1. (Gibbs random field) For the same $G$, we say that a collection of random variables $X_V\thicksim\mu$ is a Gibbs random field with respect to $G$ if $\mu$ can be written $$\mu(dx_V)=\frac{\mu_0(dx_V)}{Z_\beta} e^{-\beta H(x_V)},$$ for some energy $H$ factorizes over the cliques like above.

Here, $\mu_0$ is the “base” or “prior” measure that appears in the constrained maximum-entropy derivation of the Gibbs measure (notice, this is the same as in exponential families -- it’s the measure that $X_V$ would obey if there were no constraint on the energy statistic), but we won’t care about it here, since it’ll just be Lebesgue measure or counting measure on whatever our state space is. Also, $Z$ is just a normalizing constant, and $\beta$ is the “inverse temperature,” which acts like an inverse variance parameter. See footnote:Klenke 538 for more info and a large deviations derivation.

There are two main reasons to care about Gibbs random fields. First, the measures that they obey (Gibbs or Boltzmann distributions) show up in statistical physics: under reasonable assumptions, physical systems that have a notion of potential energy will have the statistics of this distribution. For more details and a plain-language derivation, I like Terry Tao’s post footnote:https://terrytao.wordpress.com/2007/08/20/math-doesnt-suck-and-the-chayes-mckellar-winn-theorem/.

Second, and more to the point here, they have a lot of nice properties from a statistical point of view. For one, they have the following nice conditional independence property:

Def 2a. (Global Markov property) We say that $X_V$ has the global Markov property on $G$ if for any $A,B,S\subset V$ so that $S$ separates $A$ and $B$ (i.e., for any path from $A$ to $B$ in $G$, the path must pass through $S$), $X_A\perp X_B\mid X_S$, or in other words, $X_A$ is conditionally independent of $X_B$ given $X_S$.

Using the graphical example from above, for instance, we see what happens if we can condition on $X_4=x_4$:

This is what happens in Def 2a. when $S=\{4\}$. Since the node 4 separates the graph into partitions $A=\{1,2,3\}$ and $B=\{5\}$, we can say that $X_1,X_2,X_3$ are independent of $X_5$ given $X_4$, or in symbols $X_1,X_2,X_3\perp X_5\mid X_4$.

The global Markov property directly implies a local property:

Def 2b. (Local Markov property) We say that $X_V$ has the local Markov property on $G$ if for any node $i\in V$, $X_i$ is conditionally independent from the rest of the graph given its neighbors.

In a picture,

Here, we are expressing that $X_2\perp X_3,X_5\mid X_1,X_4$, or in words, $X_2$ is conditionally independent from $X_3$ and $X_5$ given its neighbors $X_1$ and $X_4$. I hope that these figures (footnote:tikz-cd) have given a feel for how easy it is to understand the dependence structure of random variables that respect a simple graph.

It’s not so hard to check that a Gibbs measure satisfies these properties on its graph. If $X_V\thicksim p$ where $p$ is a Gibbs density/pmf, then $p$ factorizes as follows: $$p(x_V)=\prod_{C\in\mathcal{C}} p_C(x_C).$$ This factorization means that $X_V$ is a “Markov random field” in addition to being a Gibbs random field, and the Markov properties follow directly from this factorization footnote:mrfwiki. The details of the equivalence between Markov random fields and Gibbs random fields are given in the Hammersley-Clifford theorem footnote:HCThm.

This means that in situations where $V$ is large and $X_V$ is hard to sample directly, there is still a nice way to sample from the distribution, and this method becomes easier when $G$ is sparse.

The Gibbs sampler

The Gibbs sampler is a simpler alternative to methods like Metropolis-Hastings, first named in an amazing paper by Stu Geman footnote:thatpaper. It’s best explained algorithmically. So, say that you had a pair of variables $X,Y$ whose joint distribution $p(x,y)$ is unknown or hard to sample from, but where the conditional distributions $p(x\mid y)$ and $p(y\mid x)$ are easy to sample. (This might sound contrived, but the language model we’re working towards absolutely fits into this category, so we will be using this a lot.) How can we sample faithfully from $p(x,y)$ using these conditional distributions?

Well, consider the following scheme. For the first step, given some initialization $x_0$ for $X$, sample $y_0\thicksim p(\cdot\mid x_0)$. Then at the $i$th step, sample $x_{i}\thicksim p(\cdot\mid y_{i-1})$, and then sample $y_{i}\thicksim p(\cdot \mid x_i)$. The claim is that the Markov chain $(X_i,Y_i)$ will approximate samples from $p(x,y)$ as $i$ grows.

It is easy to see that $p$ is an invariant distribution for this Markov chain: indeed, if it were the case that $x_0,y_0$ were samples from $p(X,Y)$, then if $X_1\thicksim p(X\mid Y=y_0)$, clearly $X_1,Y_0\thicksim p(X\mid Y)p(Y)$, since $Y_0$ on its own must be distributed according to its marginal $p(Y)$. By the same logic, $X_1$ is distributed according to its marginal $p(X)$, so that if $Y_1$ is chosen according to $p(Y\mid X_0)$, then the pair $X_1,Y_1\thicksim p(Y\mid X)p(X)=p(X,Y)$.

The proof that this invariant distribution is the limiting distribution of the Markov chain is more involved and can be found in the Geman paper, but to me this is the main intuition.

This sampler is especially useful in the context of graphical models, and more so when the graph has some nice regularity. The Gemans make great use of this to create a parallel algorithm for sampling from their model, which would otherwise be very hard to study. For a simpler example of the utility, consider a lattice model (you might think of an Ising model), i.e. say that $G$ is a piece of a lattice: take $V=\{1,\dots,n\}\times\{1,\dots,n\}$ and let $E=\{((h,i),(j,k):\lvert h-j + i-k\rvert=1\}$ be the nearest-neighbor edge system on $V$. Say that $X_V$ form a Gibbs random field with respect to that lattice (here we let $n=4$):

Even the relatively simple Gibbs distribution of the free Ising model, $$p(x_V)=\frac{1}{Z_\beta}\exp\left( \beta\, {\textstyle\sum_{u,v\in E} x_u x_v }\right)$$ is hard to sample. But in general, any Gibbs random field on this graph can be easily sampled with the Gibbs sampler: if $N_u$ denotes the neighbors of the node $u=(i,j)$, then to sample from $p$, we can proceed as follows: let $x_V^0$ be a random initialization. Then, looping through each site $u\in V$, sample the variable $X_u\thicksim p(X_u\mid X_{N_u})$, making sure to use the new samples where available and the initialization elsewhere. Then let $x_V^1$ be the result of this first round of sampling, and repeat. This algorithm is “local” in the sense that each node only needs to care about its neighbors, so the loop can be parallelized simply (as long as you take care not to sample two neighboring nodes at the same time).

Later, we will be dealing with bipartite graphs with lots of nice structure that make this sampling even easier from a computational point of view.

The Ising model

One important domain where graphical models have been used is to study what are known as “spin glasses.” Don’t ask me why they are called glasses, I don’t know, maybe it’s a metaphor. The word “spin” shows up because these are models of the emergence of magnetism, and magnetism is what happens when all of the little particles in a chunk of material align their spins. Spin is probably also a metaphor.

In real magnets, these spins can be oriented in any direction in God’s green three-dimensional space. But to simplify things, some physicists study the case in which each particle’s spin can only point “up” or “down” -- this is already hard to deal with. The spins are then represented by Bernoulli random variables, and in the simplest model (the Ising model footnote:ising), a joint pmf is chosen that respects the “lattice” or “grid” graph in arbitrary dimension. The grid is meant to mimic a crystalline structure, and keeps things simple. Since each site depends only on its neighbors in the graph, Gibbs sampling can be done quickly and in parallel to study this system, which turns out to yield some interesting physical results about phase changes in magnets.

Now, this is an article about applications of graphical models to natural language processing, so I would not blame you for feeling like we are whacking around in the weeds. But this is not true. We are not whacking around in the weeds. The Ising model is the opposite of the weeds. It is a natural language processing algorithm, and that’s not a metaphor, or at least it won’t be when we get done with it. Yes indeed. We’re going to go into more detail about the Ising model. That’s right. I am going to give you the particulars.

Let’s take a look at the Ising pmf, working in two dimensions for a nice mix of simplicity and interesting behavior. First, let’s name our spin: Let $z_V$ be a matrix of Bernoulli random variables $z_{i,j}\in\{0,1\}$. Here, we’re letting $z_{i,j}=0$ represent a spin pointing down at position $(i,j)$ in the lattice -- a 1 means spin up, and $V$ indicates the nodes in our grid.

Now, how should spins interact? Well, by the local Markov property, we know that $$p(z_{i,j}\mid z_V)=p(z_{i,j}\mid z_{N_{i,j}}),$$ where $$N_{i,j}=\{(i’,j’)\mid \lvert i’-i\rvert + \lvert j’ - j\rvert =1 \}$$ is the set of neighbors of the node at position $(i,j)$. In other words, this node’s interaction with the material as a whole is mediated through its direct neighbors in the lattice. And, since this is a magnet, the idea is to choose this conditional distribution such that $z_{i,j}$ is likely to align its spin with those of its neighbors. Indeed, following our Gibbs-measure style, let $$p(z_{i,j}\mid z_{N_{i,j}})=\frac{1}{Z_{i,j}}\exp(-\beta H_{i,j}).$$ Here $E_{i,j}$ is the “local” magnetic potential at our site $i,j$, $$H_{i,j}=-z_{i,j}\sum_{\alpha\in N_{i,j}} z_\alpha,$$ and $Z_{i,j}$ is the normalizing constant for this conditional distribution

So, does this conditional distribution have the property that $z_{i,j}$ will try to synchronize with its neighbors? Well, let’s compute the conditional probabilities that $z_{i,j}$ is up or down as a function of its neighbors. \begin{align*} p(z_{i,j}=1\mid z_{N_{i,j}}) &= \frac{1}{Z_{i,j}} \exp\left({\textstyle \sum_{\alpha\in N_{i,j}} z_\alpha}\right)\\\\ p(z_{i,j}=0\mid z_{N_{i,j}}) &= \frac{1}{Z_{i,j}}e^0=1. \end{align*} Then we have $Z_{i,j}=1 + \exp\left({\textstyle \sum_{\alpha\in N_{i,j}} z_\alpha}\right)$. In other words, $$p(z_{i,j}=1\mid z_{N_{i,j}})=\sigma\left({\textstyle \sum_{\alpha\in N_{i,j}} z_\alpha}\right),$$ where $\sigma$ is the sigmoid function $\sigma(x)=e^{x}/(e^{x}+1)$ that often appears in Bernoulli pmfs. This function is increasing, which indicates that local synchrony is encouraged. I should also note that Ising is usually done with $\text{down}=-1$ instead of 0, which is even better for local synchrony.

I thought that giving the local conditional distributions would be a nice place to start for getting some intuition on the model -- in particular, these tell you how Gibbs sampling would evolve. Since this distribution can be sampled by a long Gibbs sampling run, we have sort of developed an intuition that the samples from an Ising model should have some nice synchrony properties with high probability.

But, can we find a joint pmf for $z_V$ that yields these conditional distributions? Indeed, if we define the energy $$H(z_V)=-\sum_{\alpha\in V}\sum_{\beta\in N_\alpha} z_\alpha z_\beta,$$ and from here the pmf $$p(z_V)=\frac{1}{Z}e^{-\beta H(z_V)},$$ then we can quickly check that this is a Gibbs random field on the lattice graph with conditional pmfs as above.

For something with such a simple description, the Ising model has a really amazing complexity. Some of the main results involve qualitative differences in the model’s behavior for different values of the inverse temperature $\beta$. For low $\beta$ (i.e., high temperature, high variance), the model is uniformly random. For large $\beta$, the model concentrates on its modes (i.e., the minima of $H$). It turns out that there is a critical temperature somewhere in the middle, with a phase change from a disordered material to one with magnetic order. There are also theorems about annealing, one of which can be found in a slightly more general setting in footnote:geman.

The complexity of these models indicates that they are good at handling information, especially when some form of synchrony is useful as an inductive bias. So, we’ll now start to tweak the model so that it can learn to model something other than a chunk of magnetic material.

Statistical learning in Gibbs random fields

Inspired by this and some related work in machine learning footnote:history, Ackley, Hinton, and Sejnowski footnote:ackley generalized the Ising model to one that can fit its pmf to match a dataset, which they called a Boltzmann machine. It turns out that their results can be extended to work for general Gibbs measures, so I will present the important results in that context, but we will stick to Ising/Boltzmann machines in the development of the theory, and as an important special case. Then the language model we’d like to develop will appear as a sort of “flavor” of Boltzmann machine, and we’ll already have the theory in place regarding how it should learn and be sampled and all that stuff.

So, here is my sort of imagined derivation of the Boltzmann machine from the Ising model. The units in the Ising model are hooked together like the atoms in a crystal. How can they be hooked together like the neurons in a brain? Well, first, let’s enumerate the sites in the Ising model’s lattice, so that the vertices are now named $V=\{1,\dots,n\}$ for some $n$, instead of being 2-d indices. Then the lattice graph has some $n\times n$ adjacency matrix $W$, where $W_{\alpha\beta}=1$ if and only if $\alpha$ and $\beta$ represent neighbors in the original lattice. So, there should be exactly 4 1s along each row, etc.

Let’s rewrite the Ising model’s distribution in terms of $W$. Indeed, we can see that $$H(z_V)=-z_V W z_V^T,$$ so that the energy is just the quadratic form on $\mathbb{R}^n$ induced by $W$. Then if we fix the notation $$x_\alpha = [W z_V^T]_\alpha,$$ we can rewrite $H_\alpha=z_\alpha x_\alpha$, and simplify our conditional distribution to $$p(z_\alpha=1\mid z_{V})=\sigma(x_\alpha).$$ Not bad.

Now, this is only an Ising model as long as $W$ remains fixed. But what if we consider $W$ as a parameter to our distribution $p(z_V; W)$, and try to fit $W$ to match some data? For example, say that we want to learn the MNIST dataset of handwritten digits, which is a dataset of images with 784 pixels. Then we’d like to learn some $W$ with $V=\{1,\dots,784\}$ such that samples from the distribution look like handwritten digits.

This can be done, but it should be noted that sometimes it helps to add some extra “hidden” nodes $z_H$ to our collection of “visible” nodes $z_V$, for $H=\{785,\dots,785 + n_H\}$, so that the random vector grows to$z=(z_1,\dots,z_{784},z_{785},\dots,z_{785+n_H})$. Let’s also grow $W$ to be a $n\times n$ matrix for $n=n_V+n_H$, where in this example the number of visible units $n_V=784$. Then the Hamiltonian becomes $H(z)=z W z^T$ just like above, and similarly for $x_\alpha$ and so on.

Adding more units gives the pmf more degrees of freedom, and now the idea is to have the marginal pmf for the visible units $p(z_V;W)$ fit the dataset, and let the “hidden” units $z_H$ do whatever they have to do to make that happen.

Ackley, Hinton, and Sejnowski came up with a learning algorithm, with the very Hintony name “contrastive divergence,” which was presented and refined (footnote:cdpapers). I’d like to first present and prove the result in the most general setting, and then we’ll discuss what that means for training a Boltzmann machine.

The most general version of contrastive divergence that I think it’s useful to consider is for $z$ (now not necessarily a binary vector) to be Gibbs distributed according to some parameters $\theta$, with distribution $$p(z;\theta)\,dz=\frac{1}{Z}\exp(-\beta H(z;\theta))\,dz.$$ Here, since we are trying to be general, we’re letting $z$ be absolutely continuous against some base measure $dz$, and $p$ is the density there. But it should help to keep in mind the special case where $p$ is just a pmf, like in the Boltzmann machine, where $z$ is a binary vector $z\in\{0,1\}^{n}$.

First, we need a quick result about the form taken by marginal distributions in Gibbs random fields, since we want to talk about the marginal over the visible units.

Def 3. (Free energy) Let $A$ be a set of vertices in the graph. Then the “free energy” $F_A(z_A)$ of the units $z_A$ is given by$$F_A(z_A)=-\log\int_{z_H} e^{-H(z)}\,dz_H.$$

If it seems weird to you to be taking an integral in the case of binary units, you can just think of that integral $\int\cdot \,dz_H$ as a sum $\sum_{z_H}$ over all possible values for the hidden portion of the vector -- this is because in the Boltzmann case, the base measure $dz$ is just counting measure.

It follows directly from this definition that:

Lemma 4. (Marginals in Gibbs distributions) For $A$ as before, if $Z_A$ is a normalizing constant, we have $$p(z_A)=\frac{1}{Z_A} e^{-F_A(z_A)}.$$

So, the free energy takes the role of the Hamiltonian/energy $H$ when we talk about marginals. We can quickly check Lemma 4 by plugging in: \begin{align*} Z_A p(z_A) &= e^{-F_A(z_A)}\\\\ &= \exp\left( \log\left( \int_{z_{\setminus A}} e^{-\beta H(z_A,z_{\setminus A})} \,dz_{\setminus A} \right) \right)\\\\ &= \int_{z_{\setminus A}} e^{-\beta H(z_A,z_{\setminus A})} \,dz_{\setminus A}, \end{align*} which agrees with the standard definition of the marginal. Maybe it should also be noted that $F_V$ is implicitly a function of $\beta$ -- we are marginalizing the distribution $p_\beta(z)$, and my notation has let $\beta$ get a little lost, since it’s not so important right now.

Contrastive divergence will work with the free energy for the visible units $F_V$. And guess what, I think we’re ready to state it. It’s not really a theorem, but more of a bag of practical results that combine to make a learning algorithm, some of which deserve proof and others of which are well-known.

“Theorem”/Algorithm 5. (Contrastive divergence) Say that $\mathcal{D}$ is the distribution of the dataset, and consider the maximum-likelihood problem of finding $$\theta^*={\arg\max}_{\theta} \mathbb{E}_{z_V^+\thicksim \mathcal{D}}[\log p(z_V^+;\theta)].$$In other words, we’d like to find $\theta$ that maximizes the model’s expected log likelihood of the visible units, assuming that the visible units are distributed according to the data. We note that this is the same as finding the minimizer of the Kullback-Liebler divergence:$$D_{KL}(\mathcal{D}\mid p(z_V;\theta))=\mathbb{E}^+[\log\mathcal{D}(z_V^+)] - \mathbb{E}^+[\log p(z_V^+;\theta)]$$since $\mathcal{D}$ does not depend on $\theta$. (In general, we’ll use superscript $+$ to indicate that a quantity is distributed according to the data, and $-$ to indicate samples from the model, and similarly $\mathbb{E}^{\pm}$ to indicate expectations taken against the data distribution or the model’s distribution.)

To find $\theta^*$, we should not hesitate to pull gradient ascent out of our toolbox. And luckily, we can write down a formula for the gradient:

TODO: note, we want gradient ascent on the good thing. put in DKL. also ascent! what is up with this. Should just rewrite the math here by hand and sort it all out for real instead of this tumblr bullpucky.

Main Result 5a. For the gradient ascent, we claim that $$\frac{\partial}{\partial\theta} \mathbb{E}_{z_V^+\thicksim \mathcal{D}}[\log p(z_V^+;\theta)] = -\mathbb{E}_{z_V^+\thicksim \mathcal{D}}\left[\tfrac{d F_V(z_V^+)}{d\theta}\right] + \mathbb{E}_{z_V^-\thicksim p(z_V;\theta)}\left[ \tfrac{d F_V(z_V^-)}{d\theta}\right].$$

All that remains is to estimate these expectations. I say “estimate” since we probably can’t compute expectations against $\mathcal{D}$ (we might not even know what $\mathcal{D}$ is), and since we can’t even sample directly from $p(z_V)$. In practice, one uses a single sample $z_V^+\thicksim\mathcal{D}$ to estimate the first expectation Monte-Carlo style, and then uses a Gibbs sampling Markov chain starting from $z_V^+$ to try to get a sample $z_V^-$ from the model’s distribution, which is then used to estimate the second expectation. So, each step of gradient descent requires a little bit of MCMC.

What requires proof is the main result in the middle there. See footnote:ackley for the proof in the special case of a Boltzmann machine, but in general the proof is not hard. Below, for brevity we fix the notation $\mathbb{E}^+$ for expectations when $z$ comes from the data, and $\mathbb{E}^-$ when $z$ comes from the model.

Proof 5a. We just have to compute the derivative of the expected log likelihood. First, let’s expand the expression that we are differentiating:\begin{align*}\mathbb{E}^+[\log p(z_V^+;\theta)]&=\mathbb{E}^+[\log(\exp(-F_V(z_V^+)) - \log Z_V]\\\\ &=-\mathbb{E}^+[F_V(z_V^+)] - \log Z_V.\end{align*}Here, we have removed the constant $\log Z_V$ from the expectation. Already we can see that differentiating the first term gives the right result. So, it remains to show that $\frac{\partial}{\partial\theta}\log Z_V=-\mathbb{E}^-[\frac{\partial}{\partial\theta} F_V(z_V^-)]$. Indeed,\begin{align*} \frac{\partial}{\partial\theta}\log Z_V &=\frac{1}{Z_V}\frac{\partial}{\partial\theta}Z_V\\\ &=\frac{1}{Z_V}\int_{z_V} \frac{\partial}{\partial\theta} e^{-F_V(z_V;\theta)}\,dz_V\\\\ &=\int_{z_V}\frac{1}{Z_V} e^{-F_V(z_V;\theta)}\,\frac{\partial}{\partial\theta} -F_V(z_V;\theta)\,dz_V\\\\ &=-\int_{z_V} p(z_V;\theta) \frac{\partial}{\partial\theta} F_V(z_V;\theta)\,dz_V\\\\ &=-\mathbb{E}^-\left[ \tfrac{d F(z_V;\theta)}{d\theta} \right], \end{align*} as desired.

The derivative of free energy will have to be computed on a case-by-case basis, and in some cases the integrals involved may not be solvable. Now, let’s take a look at how this works for a typical Boltzmann machine.

Contrastive divergence for restricted Boltzmann machines

Recall that the Boltzmann machine is a Gibbs random field with energy$$H(z)=-\frac{1}{2} z W z^T.$$I haven’t said much about these yet, so let’s clear some things up. First off, we need $W$ to be a symmetric matrix, so that the energy stays positive. Second, it’s typical to fix the diagonal elements $W_{ii}=0$, although this can be dropped if necessary.

This is the most general flavor of Bernoulli Boltzmann machine, with each node connected to all of the others -- we’ll call it a “fully-connected” Boltzmann machine. These tend not to be used in practice. One reason for that is that Gibbs sampling cannot be done in parallel, since the model respects only the fully connected graph, which makes everything really slow.

It would be more practical to use a graph with sparser dependencies. The most popular flavor is to pick a bipartite graph, where the partitions are the visible and hidden nodes. In other words, any edge in the graph connects a visible unit with a hidden one. This is called a “restricted” Boltzmann machine, and in this case $W$ has the special block form $$W=\left[\begin{array}{c|c} 0 & M \\\\ \hline M^T & 0 \end{array} \right].$$Here, we’ve enforced the bipartite and symmetric structure with notation, and we let $M$ be an $n_V\times n_H$ matrix. Then our energy can be written\begin{align*}H(z)&=-\frac{1}{2} z W z^T\\\\&= -\frac{1}{2} z_V M z_H^T - \frac{1}{2} z_H M^T z_V^T\\\\&=-z_V M z_H^T.\end{align*}Also, we can rewrite the “inputs” to units $x_i=[M z_H^T]_i$ for $i\in V$ and $x_i=[M^T z_V]_i$ for $i\in H$ -- this will be useful notation to have later.

OK. Now, we’d like to compute the expression in Main Result 5a so that we can do gradient descent and train our RBM. Let’s begin by calculating the free energy of the visible units. Starting from the definition,\begin{align*} F_V(z_V) &=-\log \sum_{z_H} e^{-H(z)}\\\\ &=-\log \sum_{z_H} \exp\bigg( z_V M z_H^T \bigg)\\\\ &=-\log \sum_{z_H} \exp\bigg( \sum_{i=1}^{n_V} \sum_{j=1}^{n_H} z^V_i M_{ij} z^H_j\bigg)\\\\ &=-\log\left[ \sum_{z^H_1=0}^1\dotsm\sum_{z^H_{n_H}=0}^1 \left( \prod_{j=1}^{n_H} \exp\!\bigg( \sum_{i=1}^{n_V} z^V_i M_{ij} z^H_j\bigg)\right)\right]. \end{align*}Here, pause to notice that we have $n_H$ sums on the outside, and that each sum only cares about one hidden unit. But then many of the terms in the product in the summand are constants from the perspective of that sum, so by linearity, we can move the product outside all of the sums and then through the logarithm:\begin{align*} F_V(z_V) &=-\log\left[ \prod_{j=1}^{n_H} \sum_{z^H_j=0}^1 \exp\!\bigg( \sum_{i=1}^{n_V} z^V_i M_{ij} z^H_j\bigg)\right]\\\\ &=-\sum_{j=1}^{n_H} \log \left( 1 + \exp\!\bigg( \sum_{i=1}^{n_V} M_{ij} z^V_i \bigg)\right)\\\\ &=-\sum_{j=1}^{n_H} \log \left( 1 + e^{x_j}\right). \end{align*} Not so bad, after all. In particular, we’ll be comfortable differentiating this with respect to the weights $M_{ij}$. Which, well, that’s what we’re doing next. These calculations might be sort of boring, and you know, the details are not that important, but we’re gonna be doing something similar later to train the new model, so it seems nice to see the way it goes in the classic model first.

Anyway, now to take the derivative. We’ll write $\partial_{M_{ij}}=\frac{\partial}{\partial M_{ij}}$ for short, and we’ll work from the second to last line in the previous display. \begin{align*} \partial_{M_{ij}} F_V(z_V) &= - \partial_{M_{ij}} \sum_{k=1}^{n_H} \log \left( 1 + \exp\!\bigg( \sum_{i=1}^{n_V} M_{ik} z^V_i \bigg)\right)\\\\ &= - \partial_{M_{ij}} \log \left( 1 + \exp\!\bigg( \sum_{i=1}^{n_V} M_{ij} z^V_i \bigg)\right) \\\\ &= - \frac{1}{1 + \exp\!\left( \sum_{i=1}^{n_V} M_{ij} z^V_i \right)} \partial_{M_{ij}} \exp\!\bigg( \sum_{i=1}^{n_V} M_{ij} z^V_i \bigg)\\\\ &= - \frac{\exp\!\left( \sum_{i=1}^{n_V} M_{ij} z^V_i \right)}{1 + \exp\!\left( \sum_{i=1}^{n_V} M_{ij} z^V_i \right)} z^V_i. \end{align*} Here, we pause to notice that the fraction in the last line is exactly $$\frac{e^{x_j}}{1+e^{x_j}}=\sigma(x_j)=p(z^H_j=1\mid z_V)=\mathbb{E}[Z^H_j\mid z_V].$$This might seem like a coincidence that just happened to pop out of the derivation, but as far as I can tell, this conditional expectation always shows up in this part of the gradient in all sorts of Boltzmann machines with similar connectivity (e.g., in footnote:honglak, footnote:zemel). I don’t know if this has been proven, but I think it’s a safe bet and a good way to check your math if you’re making your own model. Plugging this in, we find $$\partial_{M_{ij}} F_V(z_V) = -z^V_i \mathbb{E}[Z^H_j \mid z_V].$$ Now, let’s plug this result back into our formula for the gradient to finish up. By the averaging property of conditional expectations,\begin{align*} \frac{\partial}{\partial M_{ij}} \mathbb{E}^+[\log p(z_V^+;\theta)] &= -\mathbb{E}^+[\partial_{M_{ij}} F_V(z_V^+)] + \mathbb{E}^-[ \partial_{M_{ij}} F_V(z_V^-)]\\\\ &= \mathbb{E}^+[ z^V_i \mathbb{E}[Z^H_j \mid z_V] ] - \mathbb{E}^-[ z^V_i \mathbb{E}[Z^H_j \mid z_V] ]\\\\ &= \mathbb{E}^+[ z^V_i z^H_j ] - \mathbb{E}^-[ z^V_i z^H_j ]. \end{align*}TODO: Some commentary on Hebbian and +- phases.

That’s nice, but it ought to be mentioned that there are a lot of practical considerations to take into account if you want to train a really good model, since the gradients are noisy and the search space is large and unconstrained. footnote:practical is a good place to start.

To finish, it should be noted that these RBMs can be stacked into what’s called a “deep belief network” -- I think this is another Hinton name. What that means is that more hidden layers are added, so that the graph becomes $n+1$-partite, where $n$ is the number of hidden layers (we add 1 for the visible layer). Then the energy function becomes$$H(z_V,z_{H_1},\dots,z{H_n})=-z_V M_1 z_{H_1}^T - \dots - z_{H_{n-1}} M_n z_{H_n}^T.$$ These networks are tough to train directly using the free-energy differentiating method above (although the math is similar). In practice, each pair of layers is trained as if it were an isolated RBM, this is called greedy layer-wise training, and is described in footnote:greedylayerwise. There’s a good tutorial on DBNs here as well footnote:http://deeplearning.net/tutorial/DBN.html.

One important thing to note about DBNs is that when you are doing Gibbs sampling in the middle of the network, you have to sample from the conditional distribution of your layer given both the layer above and the layer below. So, there is a little more bookkeeping involved. There are other methods for sampling these networks too (footnote:honglak, footnote:wheredoeshintontalkaboutdeepdreamthing).

Developing langmod-dbn

So, we’ll be working on a DBN to model language, since probably a single layer will not be enough to model anything very interesting. But, since we’ll be using greedy layer-wise training, we can just describe how a “langmod-rbm” looks, and that will take us most of the way there.

OK, so. We want to develop a family of probability distributions $p_n(z^{(n)}_V)$, on strings $z_V^{(n)}$ of length $n$, one probability distribution for each possible length $n$. These strings will come from an “alphabet” $\mathcal{A}$. For example, we might have $\mathcal{A}=\{a,b,c,\dots\}$, or we might allow capital letters or punctuation, etc. So, a length-$n$ string is then just some element of $\mathcal{A}^n$. For short, let’s let $N=\lvert\mathcal{A}\rvert$ be the size of the alphabet.

Our starting point is going to be the RBM as above. The first change is that we will focus on the special case where $M$ is not a general linear transform, but rather a convolution (actually cross-correlation) with some kernel. This is just a choice of the form that $M$ takes, so we are in the same arena so far. There is a good deal of work around convolutional Boltzmann machines for image processing tasks (especially footnote:honglak), but I’m not aware of any work using a convolutional Boltzmann machine for natural language.

(The thing is that we’re going to do sort of a strange convolution, so if you’re familiar with them this will be a little weird. If you’re not, well I’m going to do my best not to assume you are, but it might help to understand discrete cross-correlations a bit. If this is confusing, you might take a look at footnote:https://en.wikipedia.org/wiki/Cross-correlation, and at Andrej Karpathy’s lecture notes on convolutional neural networks footnote:http://cs231n.github.io/convolutional-networks/ for a machine learning point of view.)

Now, it does not make much sense to encode $z_V$ (we’ll use $z_V$ as shorthand for $z_V^{(n)}$) as just an element of $\mathcal{A}^n$, since $\mathcal{A}$ is not an alphabet of numbers. But we can make letters into numbers by fixing a function $\phi:\mathcal{A}\to \{1,\dots,N\}$, so that we can now associate an integer to each letter.

But the choice of $\phi$ is pretty arbitrary, and it does not make much sense to let $z_V$ just be a vector of integers like this. Rather, let’s convert strings into 2-d arrays using what’s called a one-hot encoding. The one-hot encoding of a letter $a$ is an $N$-d vector $e(a)$, where $e(a)_i=1$ if $i=\phi(a)$, and $e(a)_i=0$ if $i\neq\phi(a)$. So, we take a letter to a binary vector with a single one in the natural position.

Then we can encode a whole string $s=a_1a_2\dotsm a_n$ to its one-hot encoding naturally by letting $$e(s)=(e(a_1)\ e(a_2)\ \dotsm\ e(a_n)).$$ So, we’re imaging strings as “images” of a sort -- the width of the image is $n$, and the height is $N$. So, we’ll define an energy function on $z_V$, where $z_V$ is no longer a binary vector like above, but rather an $n\times N$ binary image with a single one in each column. This way, the hidden units can “tell” what letters are sending them input.

Now, instead of having each hidden unit receive input from all of the visible units, as in an ordinary RBM, we’ll say that each hidden unit receives input from some $k$ consecutive columns of visible units (i.e. $k$ consecutive letters), for some choice of $k$. We’ll also assume that the input is “stationary,” just like Shannon did in his early language modelling work footnote:shannon. This means we are assuming that a priori, the distribution of each letter in the string is not influenced by its position in the string. In other words, the statistics of the string are invariant under translations.

But since the input is translation invariant, each hidden unit should expect statistically similar input. So, it makes sense to let the hidden units be connected to their $k$ input letters by the same exact weights. Here’s a graphic of “sliding” those weights across the visible field to produce the inputs $x_j$ to the hidden layer:

This figure footnote:figurepeople takes a little explaining, but what it shows is exactly how we compute the cross-correlation of the visible field and the weights. Here, the blue grid represents the visible field. The weights (we’ll be calling them the “kernel” or the “filter”, and they’re represented by the shadow on the blue grid) are sliding along the width dimension of the input, and we have chosen a kernel width $k=3$ in the figure. In general, the kernel height is arbitrary, but here we’ve chosen it to be exactly the height of the visible field, which is the alphabet size $N$. This way each row of the kernel is always lined up against the same row of the visible field, so that it is always activated by the same letter, and it can learn the patterns associated with that letter.

At this point, it might help to put down a working definition of cross-correlation, which should be thought of as something like an inner product. (You can think of $u$ as the $n\times N$ visible field, and $v$ as the $k\times N$ filter.)

Def 6. (Cross-correlation) Let $u$ be an $n_1\times n_2$ matrix, and let $v$ be an $m_1\times m_2$ matrix, such that $n_1\geq m_1$ and $n_2\geq m_2$. Then the cross-correlation $u*v$ is the $(n_1-m_1+1)\times(n_2-m_2+1)$ matrix with elements$$[u*v]_{ij}=\sum_{p=0}^{m_1}\sum_{q=0}^{m_2} u_{i+p,j+q}v_{pq}.$$

This differs from a convolution, which is the same but with $v$ reflected on all of its axes first. For some reason machine learning people just call cross-correlations convolutions, so that’s what I’m doing too.

We are extremely close to writing down the energy function for our new network. I keep doing that and then realizing I forgot like 10 things that needed to be said first. I think the last thing that needs saying is this: we’ve shown how to get one row of hidden units, with one filter. In a good network, these units will all be excited by whatever length-$k$ substrings of the input have high correlation with the filter. This will hopefully be some common feature of length-$k$ strings -- for small $k$, this could be something like a phoneme -- I don’t know. Maybe the filter would recognize the short building blocks of words, like “nee” or “bun,” little word-clusters that show up a lot.

But this filter is only going to be sensitive to one feature, and we’d like our model to work with lots of features. So, let’s add more rows of hidden units. Each row will be just like the one we’ve already described, but each row will use its own filter. We’ll stack the rows together, so that the hidden layer now is now image shaped. Let’s let $c$ be the number of filters, and we’ll let $K$ be our filter bank -- this is going to be a 3-d array of shape $c\times k\times N$.

Then, the input to the hidden unit $z^H_{ij}$ is given by the cross correlation of the $j$th filter with the input. If we use $*$ also to mean cross-correlation with a filter bank, we have:$$x^H_{ij}=[z_V*K]_{ij}:=\sum_{p=0}^k \sum_{q=0}^N z^V_{i+p,q} K_{j,p,q}.$$Then the hidden layer $z_H$ will have shape $n-k+1\times c$.

This leads us to a natural definition for the energy function. Since cross-correlation is “like an inner product,” we’ll use it to define a quadratic form just like we did in the RBM:\begin{align*}H(z_V,z_H) &=-\langle z_V * K, z_H\rangle\\\\ &:= - \sum_{i=0}^{n-k-1}\sum_{j=0}^c x^H_{ij} z^H_{ij}\\\\ &= \sum_{i=0}^{n-k-1}\sum_{j=0}^c z^H_{ij} \sum_{p=0}^k \sum_{q=0}^N z^V_{i+p,q} K_{j,p,q}.\end{align*}So, there we have the heart of our new convolutional Boltzmann machine, since the probability is basically determined by the energy.

Categorical input units

There’s only one change left to get us to something that could be called a language model. Right now, the machine has a little problem, which is that it’s possible for a sample from the machine to have two or more units on in some column of the sample. We didn’t use a two-hot encoding for our strings, though, so that sample can’t be understood as a string. We have to find some way to make sure this never happens.

The answer is pretty simple: we just restrict our state space. So far, we had been looking at $z\in\{0,1\}^{V\times H}$, the set of all possible binary configurations of all the units. But let’s restrict our probability distribution to the set$$\Omega=\bigg\{z\in\{0,1\}^{V\times H} : \forall i\,\sum_{j} z^V_{ij}= 1 \bigg\}.$$Then our distribution $p(z)$ is still a Gibbs random field on this state space, but we’re left with a question: how do we decide which of the units in each column should be active? The visible units in a column are no longer conditionally independent, so this manipulation of the state space has changed the graph that our model respects by linking the visible columns into cliques.

So, let’s compute the conditional distribution of the visible units given the hidden units. Since the energy is the same, we still have that $$p(z^V_{ij}=1\mid z_H)=\frac{1}{Z} e^{x^V_{ij}},$$ for some normalizing constant that we’ll just call $Z_{ij}$ for now. The question is, what is the normalizing constant? Well, it satisfies $$Z_{ij}=Z_{ij}p(z^V_{ij}=1\mid z_H) + Z_{ij} p(z^V_{ij}=0\mid z_H),$$and we know that since exactly one unit is on,$$Z_{ij}p(z^V_{ij}=0\mid z_H)=\sum_{k\neq j} Z_{ij} p(z^V_{ik}=1\mid z_H).$$ Plugging back into the previous line, we get$$Z_{ij}=\sum_{k}Z_{ij} p(z^V_{ik}=1\mid z_H).$$ But $Z_{ij}$ obeys this relationship for any choice of $j$, so we see that in fact $$Z_{i0}=Z_{i1}=\dotsm=Z_i$$is constant over the whole column, and that its value is $$Z_i=\sum_{k} Z_i p(z^V_{ik}=1\mid z_H) = \sum_k e^{x^V_{ik}}.$$So, now we have found our conditional distribution $$p(z^V_{ij}=1\mid z_H)=\frac{e^{x^V_{ij}}{\sum_k e^{x^V_{ik}}}.$$ But this is exactly the softmax footnote:softmax function that often appears when sigmoid-Bernoulli random variables are generalized to categorical random variables! That’s nice and encouraging, and it gives us a simple sampling recipe. Just compute the softmax of $x^V$ over columns, and then run a categorical sampler over each column of the result.

It might be worth noting that this trick of changing the distribution of the input units is a common one: often for real-valued data, people will use Gaussian visible units and leave the rest of the network Bernoulli footnote:GaussBerRBM. Also, I should note that I was inspired to do this by the similar state-space restriction that Lee et al. used to construct their probabilistic max pooling layer footnote:honglak -- this is a pretty direct adaptation of that trick.

Lastly, it might be worth noting that you could do the same thing to the hidden layer that we have just done to the visible layer -- this could be helpful, since it would keep the network activity sparse. Also, in the language domain, maybe it’s appropriate to think of hidden features as mutually exclusive in some way (you can’t have two phonemes at once). I would bet that this assumption starts making less sense as you go deeper in the net. The problem with this idea is that unlike what we’ve just done in the visible layer, changing the state space of the hidden units ends up changing the integral that computes the free energy of the visible units, which means we would have to do that again. So for now, let’s not do this.

Training the language model

Now, we’d like to get this thing learning. But for once, we are in for a lucky break. Notice that this categorical input units business hasn’t changed anything about the free energy -- that integral is still the same. So, since the learning rule is basically determined by the free energy, that means that this categorical input convolutional RBM has the same exact learning rule as a regular old convolutional RBM.

And, we’re in for another break too. Cross-correlations are linear maps, so a convolutional RBM is just a special case of the generic RBM whose learning rule we’ve computed above. In particular, there exists a matrix $M_K$ so that if $z_V’$ is $z_V$ flattened from an image to a flat vector, $M_Kz_V=(z_V*K)’$. The thing is that $M_K$ is very sparse, so we need to figure out the update rule for the elements of $M_K$ that are not fixed to 0.

Well, first of all, our free energy hasn’t changed, but we can write it in a more convolutional style: $$F_V(z_V)=-\sum_{i=0}^{n-k+1}\sum_{j=0}^c \log(1 + e^{x^H_{ij}}).$$ So, all we need to do is compute $\partial_{K_{ors}}$ of this expression ($K$ is hidden in $x^H_{ij}$). This is quick, so I’ll write it out:\begin{align*}\partial_{K_{ors}} F_V(z_V) &=\partial_{K_{ors}} - \sum_{i=1}^{n-k+1}\sum_{j=1}^c}\log(1+e^{x^H_{ij}}\\\\ &=-\sum_{i=0}^{n-k+1} \partial_{K_{ors}} \log(1+e^{x^H_{io}})\\\\ &=-\sum_{i=0}^{n-k+1}\frac{1}{1+e^{x^H_{ij}}} \partial_{K_{ors}} e^{x^H_{io}}\\\\ &=-\sum_{i=0}^{n-k+1}\sigma(x^H_{io}) \partial_{K_{ors}} x^H_{io}\\\\ &=-\sum_{i=0}^{n-k+1}\sigma(x^H_{io}) z^V_{i+r,s},\end{align*}where the last line follows directly from the definition of $x^H_{io}$ given above.

But on the other hand, what if we did want to have categorical columns in the hidden layer? It’s plausible to me to think that only one “language feature” might be active at a given time, and this has the added bonus of keeping the hidden layer activities sparse. This amounts to imposing on the hidden layer the same restriction that we earlier imposed on the state space for the visible layer, i.e. we choose the new state space$$\Omega’=\bigg\{z\in\{0,1\}^{V\times H} : \forall i\,\sum_{j} z^V_{ij}= 1 = \sum_j z^H_{ij} \bigg\}.$$Then, of course, we sample the hidden units in the same way that we just discussed sampling categorical visible units.

But now we’ve changed the state space of $z_H$, which in turn changes the behavior of the $\int \, dz_H$ that appears in the definition of free energy. So if we want to train a net like this, we’ll need to re-compute the free energy and see what happens to the learning rule.

This is sort of the convolutional version of the outer product that appeared in the RBM, and it’s a special case of the gradient needed to train a convolutional DBN like footnote:honglak, which allows for a kernel that is not full-height like ours.

Many thanks especially to Matt Ricci for working through all of this material with me, and to Professors Matt Harrison, Nicolas Garcia Trillos, Govind Menon, and Kavita Ramanan for teaching me the math.

we could just use an ordinary dbn and make sure to sample from the conditional distribution on only one per col in vis layer

modify sliding kernel pic to have full-height kernel, show the alphabet on the side, stuff like that. cite that guy’s tool if it works

let’s try that

ok, but just for fun, let’s encode that constraint into the model. maybe it’ll make it more structured and able to automatically deal with sparsity problems.

at this point you have as much intuition for me if this thing should work. let’s try it out. more layers is tough?? idk

mention more Boltzmann “mods” like DUBM, or more standard Gaussian, or Gaussian-Bernoulli. Mention gemangeman model.

search for todo and footnote when done. also resize the figures.w

0 notes

filiplig · 7 years

Text

Ellenberg, Jordan - How Not to Be Wrong

page 6 | location 81-86 | Added on Sunday, 1 February 2015 17:10:59

“Mathematics is pretty much the same. You may not be aiming for a mathematically oriented career. That’s fine—most people aren’t. But you can still do math. You probably already are doing math, even if you don’t call it that. Math is woven into the way we reason. And math makes you better at things. Knowing mathematics is like wearing a pair of X-ray specs that reveal hidden structures underneath the messy and chaotic surface of the world. Math is a science of not being wrong about things, its techniques and habits hammered out by centuries of hard work and argument. With the tools of mathematics in hand, you can understand the world in a deeper, sounder, and more meaningful way. All you need is a coach, or even just a book, to teach you the rules and some basic tactics. I will be your coach. I will show you how.”

page 6 | location 81-86 | Added on Sunday, 1 February 2015 17:11:40

page 15 | location 224-226 | Added on Sunday, 1 February 2015 17:28:05

We tend to teach mathematics as a long list of rules. You learn them in order and you have to obey them, because if you don’t obey them you get a C-. This is not mathematics. Mathematics is the study of things that come out a certain way because there is no other way they could possibly be.

page 16 | location 235-237 | Added on Sunday, 1 February 2015 17:30:10

prosthesis that you attach to your common sense, vastly multiplying its reach and strength. Despite the power of mathematics, and despite its sometimes forbidding notation and abstraction, the actual mental work involved is little different from the way we think about more down-to-earth problems.

page 42 | location 635-642 | Added on Tuesday, 3 February 2015 17:20:58

What’s the numerical value of an infinite sum? It doesn’t have one—until we give it one. That was the great innovation of Augustin-Louis Cauchy, who introduced the notion of limit into calculus in the 1820s.* The British number theorist G. H. Hardy, in his 1949 book Divergent Series, explains it best: It does not occur to a modern mathematician that a collection of mathematical symbols should have a “meaning” until one has been assigned to it by definition. It was not a triviality even to the greatest mathematicians of the eighteenth century. They had not the habit of definition: it was not natural to them to say, in so many words, “by X we mean Y.” . . . It is broadly true to say that mathematicians before Cauchy asked not, “How shall we define 1 − 1 + 1 − 1 + . . .” but “What is 1 − 1 + 1 − 1 + . . . ?” and that this habit of mind led them into unnecessary perplexities and controversies which were often really verbal.

page 52 | location 790-794 | Added on Tuesday, 3 February 2015 17:42:46

Dissatisfying as it may be to partisans, I think we have to teach a mathematics that values precise answers but also intelligent approximation, that demands the ability to deploy existing algorithms fluently but also the horse sense to work things out on the fly, that mixes rigidity with a sense of play. If we don’t, we’re not really teaching mathematics at all. It’s a tall order—but it’s what the best math teachers are doing, anyway, while the math wars rage among the administrators overhead.

page 66 | location 1004-1006 | Added on Wednesday, 4 February 2015 17:27:34

50%. That’s how the Law of Large Numbers works: not by balancing out what’s already happened, but by diluting what’s already happened with new data, until the past is so proportionally negligible that it can safely be forgotten.

page 91 | location 1390-1391 | Added on Sunday, 8 February 2015 16:43:33

The point of Bennett’s paper is to warn that the standard methods of assessing results, the way we draw our thresholds between a real phenomenon and random static, come under dangerous pressure in this era of massive data sets, effortlessly obtained.

page 109 | location 1666-1669 | Added on Sunday, 8 February 2015 17:18:38

If only we could go back in time to the dawn of statistical nomenclature and declare that a result passing Fisher’s test with a p-value of less than 0.05 was “statistically noticeable” or “statistically detectable” instead of “statistically significant”! That would be truer to the meaning of the method, which merely counsels us about the existence of an effect but is silent about its size or importance. But it’s too late for that.

page 145 | location 2217-2223 | Added on Tuesday, 10 February 2015 17:38:22

For Neyman and Pearson, the purpose of statistics isn’t to tell us what to believe, but to tell us what to do. Statistics is about making decisions, not answering questions. A significance test is no more or less than a rule, which tells the people in charge whether to approve a drug, undertake a proposed economic reform, or tart up a website. It sounds crazy at first to deny that the goal of science is to find out what’s true, but the Neyman-Pearson philosophy is not so far from reasoning we use in other spheres. What’s the purpose of a criminal trial? We might naively say it’s to find out whether the defendant actually committed the crime they’re on trial for. But that’s obviously wrong. There are rules of evidence, which forbid the jury from hearing testimony obtained improperly, even if it might help them accurately determine the defendant’s innocence or guilt. The purpose of a court is not truth, but justice.

page 165 | location 2528-2536 | Added on Wednesday, 11 February 2015 17:25:27

In the Bayesian framework, how much you believe something after you see the evidence depends not just on what the evidence shows, but on how much you believed it to begin with. That may seem troubling. Isn’t science supposed to be objective? You’d like to say that your beliefs are based on evidence alone, not on some prior preconceptions you walked in the door with. But let’s face it—no one actually forms their beliefs this way. If an experiment provided statistically significant evidence that a new tweak of an existing drug slowed the growth of certain kinds of cancer, you’d probably be pretty confident the new drug was actually effective. But if you got the exact same results by putting patients inside a plastic replica of Stonehenge, would you grudgingly accept that the ancient formations were actually focusing vibrational earth energy on the body and stunning the tumors? You would not, because that’s nutty. You’d think Stonehenge probably got lucky. You have different priors about those two theories, and as a result you interpret the evidence differently, despite it being numerically the same.

page 294 | location 4500-4503 | Added on Tuesday, 17 February 2015 17:37:56

In math there are many, many complicated objects, but only a few simple ones. So if you have a problem whose solution admits a simple mathematical description, there are only a few possibilities for the solution. The simplest mathematical entities are thus ubiquitous, forced into multiple duty as solutions to all kinds of scientific problems.

page 308 | location 4713-4715 | Added on Wednesday, 18 February 2015 17:34:55

Darwin showed that one could meaningfully talk about progress without any need to invoke purpose. Galton showed that one could meaningfully talk about association without any need to invoke underlying

page 315 | location 4824-4827 | Added on Wednesday, 18 February 2015 17:47:04

“Lives in the same city as” is transitive, too—if I live in the same city as Bill, who lives in the same city as Bob, then I live in the same city as Bob. Correlation is not transitive. It’s more like “blood relation”—I’m related to my son, who’s related to my wife, but my wife and I aren’t blood relatives to each other. In fact, it’s not a terrible idea to think of correlated variables as “sharing part of their DNA.”

page 319 | location 4883-4885 | Added on Wednesday, 18 February 2015 17:53:10

Keep this in mind when you’re told that two phenomena in nature or society were found to be uncorrelated. It doesn’t mean there’s no relationship, only that there’s no relationship of the sort that correlation is designed to detect.

page 330 | location 5050-5056 | Added on Wednesday, 18 February 2015 18:12:40

So we don’t and can’t know the exact expected value of launching a campaign against eggplant or vibrating toothbrushes, or tobacco. But often we can say with confidence that the expected value is positive. Again, that doesn’t mean the campaign is sure to have good effects, only that the sum total of all similar campaigns, over time, is likely to do more good than harm. The very nature of uncertainty is that we don’t know which of our choices will help, like attacking tobacco, and which will hurt, like recommending hormone replacement therapy. But one thing’s for certain: refraining from making recommendations at all, on the grounds that they might be wrong, is a losing strategy. It’s a lot like George Stigler’s advice about missing planes. If you never give advice until you’re sure it’s right, you’re not giving enough advice.

page 340 | location 5199-5201 | Added on Thursday, 19 February 2015 18:02:02

Public opinion doesn’t exist. More precisely, it exists sometimes, concerning matters about which there’s a clear majority view. Safe to say it’s the public’s opinion that terrorism is bad and The Big Bang Theory is a great show. But cutting the deficit is a different story. The majority preferences don’t meld into a definitive stance.

page 376 | location 5754-5757 | Added on Sunday, 22 February 2015 13:25:05

I once met a historian of German culture in Columbus, Ohio, who told me that Hilbert’s predilection for wearing sandals with socks is the reason that fashion choice is still noticeably popular among mathematicians today. I could find no evidence this was actually true, but it suits me to believe it, and it gives a correct impression of the length of Hilbert’s shadow.

page 386 | location 5906-5912 | Added on Sunday, 22 February 2015 13:47:40

But most of the mathematicians I work with now weren’t ace mathletes at thirteen; they developed their abilities and talents on a different timescale. Should they have given up in middle school? What you learn after a long time in math—and I think the lesson applies much more broadly—is that there’s always somebody ahead of you, whether they’re right there in class with you or not. People just starting out look to people with good theorems, people with some good theorems look to people with lots of good theorems, people with lots of good theorems look to people with Fields Medals, people with Fields Medals look to the “inner circle” Medalists, and those people can always look toward the dead. Nobody ever looks in the mirror and says, “Let’s face it, I’m smarter than Gauss.” And yet, in the last hundred years, the joined effort of all these dummies-compared-to-Gauss has produced the greatest flowering of mathematical knowledge the world has ever seen.

page 409 | location 6267-6269 | Added on Sunday, 22 February 2015 14:21:58

flows with vastly augmented force. The lessons of mathematics are simple ones and there are no numbers in them: that there is structure in the world; that we can hope to understand some of it and not just gape at what our senses present to us; that our intuition is stronger with a formal exoskeleton than without one.

page 409 | location 6271-6277 | Added on Sunday, 22 February 2015 14:23:10

Every time you observe that more of a good thing is not always better; or you remember that improbable things happen a lot, given enough chances, and resist the lure of the Baltimore stockbroker; or you make a decision based not just on the most likely future, but on the cloud of all possible futures, with attention to which ones are likely and which ones are not; or you let go of the idea that the beliefs of groups should be subject to the same rules as beliefs of individuals; or, simply, you find that cognitive sweet spot where you can let your intuition run wild on the network of tracks formal reasoning makes for it; without writing down an equation or drawing a graph, you are doing mathematics, the extension of common sense by other means. When are you going to use it? You’ve been using mathematics since you were born and you’ll probably never stop. Use it well.

0 notes

ferroplusferro · 7 years

Video

undefined

tumblr

FYeye Friday: Happy Birthday #?

Pound? Number? Proofreader’s space? Phone option? Hash-tag? A tangled journey to become one of the most used typographic symbols of our era! No, as much as I wanted to have begun as a skewed tic-tac-toe board or some nine-based system of measurement, most believe that the octothorpe or octothorp or therp or tharp (a word with it’s own conflated, if recent, past) origin story likely involves, much like other origins, a simplified and bastardized something so someone might more swiftly explain some sum to somebody else. Those ancient Romans measured things in libra pondos (weight in pounds). That scribble seems to have been later scrawled into history and supplemented with a horizontal scratch by the likes of Isaac Newton in the 1300s. (Again with the English math men!) Or maybe that sign actually turned into today’s British pound symbol [£] which is an altogether different doodad. Opinions differ. Assuming all that led to its use as the representation of weight in pounds rather than monetary value, we can live with it following a number. 2# = two pounds. OK then. BUT, then can we blame the British for its use as shorthand for the word “number?” Yes… and no. Typewriters in the UK had a ”£” in the same spot where US Smith Coronas and such had the “#.” No, the Brits rarely use use #, preferring the written “number” or “No.” There seems to be no consensus on how it was repurposed to redundantly mean “number” before a number. Foreshadowing social media stuff perhaps? Well… A theory that it was a representation of a village with eight surrounding fields would be fine — if it weren’t slanted! It’s even referred to as a hex in south East Asia for reasons that defy my understanding of hex as six. And don’t even think about the musical notation for sharp! Another source says the “hatch” of cross hatched engravings is what to the name if not the symbol itself. Balderdash! Hash is a mash-up of little bits of potatoes and corned beef mixed and grilled to crispy deliciousness—or any other mixture of miscellaneous stuff never organized in any manner—much less an orderly matrix. Wait a minute… what about the proofreader’s mark for a space? What grammarian chose to confuse typographers and designers with a mini-grid that meant number or pounds, but neither in this context? Nope. Insert a space. Plus, these sadistic folks also use it to mark the end of something or a bad sentence or a “word boundary.” What? Did Bell Labs engineers pick up that symbol as a humorous pun on the above for those first touch tone phones or for some use in UNIX? They actually lay claim to that word octo (eight)-thorpe (Thorpe—[allegedly Jim Thorpe fans]). Maybe it just looked like the layout of a keypad? We can’t seem to hash that out! Whatever you want to believe, it’s likely tech types liked the skewed grid and and it became a multi-purpose ASCII character used to delineate bits of coding in pretty much every computer language thereafter. HTML code uses #s for lots of things. Colors for example: <bgcolor=#123456> Hold on— those are called... HEX CODES. Mind. Blown. The logical leap says that our current social media saturation with said symbol — kick-started by Twitter 10 years ago this week with #barcamp— is simply an extension of coding standards that gave us and our computing machines a sign of a subcategory or some such. That gives you something to think about, but what I really want to know is why it’s the only italic character in an otherwise Roman (straight up) set of letters. Visual tension can be a good thing, but I so want to straighten that thing up. Its slant provides a sense of movement which — combined with our current online-all-the-time, tag-it-all reality — infuses hash-tagged visuals with energy, relevance, and plugged-in immediacy. To rehash, #octothorpe #hex #number #pound #wordspace #hashtag #whatever, splash a dash of that # about on your journey to energize your #coms or #logos or #socialmediagraphics before its panache passes by. F+FYI—Ferro+Ferro is always here to lead your #visualcommunications — #branding, #screen, #print, and more in a #positivedirection. ### From whence I plagiarized, I mean adapted, this info… https://en.wikipedia.org https://books.google.com/books?id=3fbWAAAAQBAJ&lpg=PA56&dq=%22octothorpe%22&pg=PA41#v=onepage&q=%22octothorpe%22&f=false https://www.quora.com/Why-has-the-symbol-been-called-a-number-sign-pound-sign-and-now-hashtag http://www.newstatesman.com/sci-tech/2014/06/history-journey-and-many-faces-hash-symbol

#socialmediagraphics graphicdesign hashtag happybirthdayhashtag comsDC

0 notes

careergrowthblog · 5 years

Text

Your curriculum defines your school. Own it. Shape it. Celebrate it.

Here are some questions you might want to ask yourself about your school curriculum:

Your curriculum defines your school. Own it. Shape it. Celebrate it. published first on https://medium.com/@KDUUniversityCollege

0 notes

careergrowthblog · 5 years

Text

Your curriculum defines your school. Own it. Shape it. Celebrate it.

Here are some questions you might want to ask yourself about your school curriculum:

Your curriculum defines your school. Own it. Shape it. Celebrate it. published first on https://medium.com/@KDUUniversityCollege

0 notes

careergrowthblog · 6 years

Text

Your curriculum defines your school. Own it. Shape it. Celebrate it.

Every school has it’s motto – those value statements emblazoned on every letterhead, every blazer, above the entrance… Respect. Courage. Resilience. Ambition. Compassion. Fortiter Ex Animo. Carpe Diem. But these grand ideas only take form in the context of students doing things, learning things, experiencing things, receiving messages about things.. actual things that you have decided on. And those things are your curriculum; the actual tangible real-life curriculum that is enacted across the days, weeks and years of a life in your school.

Here are some questions you might want to ask yourself about your school curriculum:

Your curriculum defines your school. Own it. Shape it. Celebrate it. published first on https://medium.com/@KDUUniversityCollege

0 notes

Last Seen Blogs

dangkypage

Bin LaDang

xiaq

Non-practicing intellectual.

idiotbutch

alternia-and-beforus

H3H3H3 Y3333SSS