#(normalising data is a data thing separate meaning from the general/social(?) use of the word
Explore tagged Tumblr posts
dkettchen · 5 months ago
Text
Tumblr media
absolutely unintelligeable meme I made during bootcamp lecture this morning
24 notes · View notes
theresawelchy · 6 years ago
Text
Beyond news contents: the role of social context for fake news detection
Beyond news contents: the role of social context for fake news detection Shu et al., WSDM’19
Today we’re looking at a more general fake news problem: detecting fake news that is being spread on a social network. Forgetting the computer science angle for a minute, it seems intuitive to me that some important factors here might be:
what is being said (the content of the news), and perhaps how it is being said (although fake news can be deliberately written to mislead users by mimicking true news)
where it was published (the credibility / authority of the source publication). For example, something in the Financial Times is more likely to be true than something in The Onion!
who is spreading the news (the credibility of the user accounts retweeting it for example – are they bots??)
Therefore I’m a little surprised to read in the introduction that:
The majority of existing detection algorithms focus on finding clues from the news content, which are generally not effective because fake news is often intentionally written to mislead users by mimicking true news.
(The related work section does however discuss several works that include social context.).
So instead of just looking at the content, we should also look at the social context: the publishers and the users spreading the information! The fake news detection system developed in this paper, TriFN considers tri-relationships between news pieces, publishers, and social network users.
… we are to our best knowledge the first to classify fake news by learning the effective news features through the tri-relationship embedding among publishers, news contents, and social engagements.
Tumblr media
And guess what, considering publishers and users does indeed turn out to improve fake news detection!
Inputs
We have
Tumblr media
publishers,
Tumblr media
social network users, and
Tumblr media
news articles. Using a vocabulary of t words, we can compute an
Tumblr media
bag-of-word feature matrix.
Tumblr media
For the m users, we can have an m x m adjacency matrix
Tumblr media
, where
Tumblr media
is 1 if i and j are friends, and 0 otherwise.
We also know which users have shared which news pieces, this is encoded in a matrix
Tumblr media
.
Tumblr media
The matrix
Tumblr media
similarly encodes which publishers have published which news pieces.
For some publishers, we can know their partisan bias. In this work, bias ratings from mediabiasfactcheck.com are used, taking just the ‘Left-Bias’, ‘Least-Bias’ (neutral) and ‘Right-Bias’ values (ignoring the intermediate left-center and right-center values) and encoding these as -1, 0, and 1 respectively in a publisher partisan label vector,
Tumblr media
. Not every publisher will have a bias rating available. We’d like to put ‘-’ in the entry for that publisher in
Tumblr media
but since we can’t do that, the separate vector
Tumblr media
encodes whether or not we have a bias rating available for publisher p.
There’s one last thing at our disposal: a labelled dataset for news articles telling us whether they are fake or not. (Here we have just the news article content, not the social context).
The Tri-relationship embedding framework
TriFN takes all of those inputs and combines them with a fake news binary classifier. Given lots of users and lots of news articles, we can expect some of the raw inputs to be pretty big, so the authors make heavy use of dimensionality reduction using non-negative matrix factorisation to learn latent space embeddings (more on that in a minute!) TriFN combines:
A news content embedding
A user embedding
A user-news interaction embedding
A publisher-news interaction embedding, and
The prediction made by a linear classifier trained on the labelled fake news dataset
Pictorially it looks like this (with apologies for the poor resolution, which is an artefact of the original):
Tumblr media
News content embedding
Let’s take a closer look at non-negative matrix factorisation (NMF) to see how this works to reduce dimensionality. Remember the bag-of-words sketch for news articles? That’s an n x t matrix where n is the number of news articles and t is the number of words in the vocabulary. NMF tries to learn a latent embedding that captures the information in the matrix in a much smaller space. In the general form NMF seeks to factor a (non-negative) matrix M into the product of two (non-negative) matrices W and H (or D and V as used in this paper). How does that help us? We can pick some dimension d (controlling the size of the latent space) and break down the
Tumblr media
matrix into a d-dimension representation of news articles
Tumblr media
, and a d-dimension representation of words in the vocabulary,
Tumblr media
. That means that
Tumblr media
has shape
Tumblr media
and so
Tumblr media
ends up with the desired shape
Tumblr media
. Once we’ve learned a good representation of news articles,
Tumblr media
we can use those as the news content embeddings within TriFN.
We’d like to get
Tumblr media
as close to
Tumblr media
as we can, and at the same time keep
Tumblr media
and
Tumblr media
‘sensible’ to avoid over-fitting. We can do that with a regularisation term. So the overall optimisation problem looks like this:
Tumblr media
User embedding
For the user embedding there’s a similar application of NMF, but in this case we’re splitting the adjacency matrix
Tumblr media
into a user latent matrix
Tumblr media
, and a user correlation matrix
Tumblr media
. So in this case we’re using NMF to learn
Tumblr media
which has shape mxd . dxd . dxm, resulting in the desired mxm shape. There’s also a user-user relation matrix
Tumblr media
which controls the contribution of
Tumblr media
. The basic idea is that any given user will only share a small fraction of news articles, so a positive case (having shared an article) should have more weight than a negative case (not having shared).
Tumblr media
User-news interaction embedding
For the user-news interaction embedding we want to capture the relationship between user features and the labels of news items. The intuition is that users with low credibility are more likely to spread fake news. So how do we get user credibility? Following ‘Measuring user credibility in social media’ the authors base this on similarity to other users. First users are clustered into groups such that members of the same cluster all tend to share the same news stories. Then each cluster is given a credibility score based on its relative size. Users take on the credibility score of the cluster they belong to. It all seems rather vulnerable to the creation of large numbers of fake bot accounts that collaborate to spread fake news if you ask me. Nevertheless, assuming we have reliable credibility scores then we want to set things up such that the latent features of high-credibility users are close to true news, and the latent features of low-credibility users are close to fake news.
Tumblr media
Publisher-news embeddings
Recall we have the matrix
Tumblr media
encoding which publishers have published which news pieces. Let
Tumblr media
be the normalised version of the same. We want to find
Tumblr media
, a weighting matrix mapping news publisher’s latent features to the corresponding partisan label vector
Tumblr media
. It looks like this:
Tumblr media
Semi-supervised linear classifier
Using the labelled data available, we also learn a weighting matrix
Tumblr media
mapping news latent features to fake news labels.
Putting it all together
The overall objective becomes to find matrices
Tumblr media
using a weighted combination of each of the above embedding formulae, and a regularisation term combining all of the learned matrices.
It looks like this:
Tumblr media
and it’s trained like this:
Tumblr media
Evaluation
TriFN is evaluated against several state of the art fake news detection methods using the FakeNewsNet BuzzFeed and PolitiFact datasets.
Tumblr media
It gives the best performance on both of them:
Tumblr media
(Enlarge)
the morning paper published first on the morning paper
0 notes