spatialshadow-blog - Tumblr blog

spatialshadow-blog · 5 years ago

Text

Job Application

Analytical Ability

https://spatialshadow.tumblr.com/post/186603275399/analytical-ability

Time Management

https://spatialshadow.tumblr.com/post/186603292784/time-management

Skils

https://spatialshadow.tumblr.com/post/186603311869/skills

Community and professionalism

https://spatialshadow.tumblr.com/post/186603334984/community-and-professionalism

Something Awesome Project

https://spatialshadow.tumblr.com/post/186603352634/something-awesome-project

1 note · View note

spatialshadow-blog · 5 years ago

Text

Something Awesome Project

The aim of the project was to create a Spam detector capable of detecting spam emails and moving them to spam folder of an email address. The program is written to be compatible with gmail email addresses

The marking criteria for the something awesome project can be found here : https://spatialshadow.tumblr.com/post/185620710789/something-awesome-project-proposal

To run the program you run the SpamBot.py followed by the email address and then the option you would like to perform

The commands you can run are as follows.

Python3 SpamBot.py <email> run [N] : will run spam detector through the email address classifying the emails and then move the emails that are classified as spam into the spam folder. Seeing as this is written for gmail what this really means it will mark the email as spam, which causes it to show up in the spam folder on the gmail email account . if N is specify it only classify the first N emails in the inbox

Python3 SpamBot.py <email> test [N] : will test the performance of the spam bot on the email address sent in. it will assume that the emails are already sorted correctly (the spam emails have been moved to the spam folder, and the ham emails in the inbox) and will report the performance and confusion matrix. If N is specified it will only tested using the first N emails in both the spam folder and the inbox.

Python3 SpamBot.py <email> train [N] : will train the spam detector on the emails in the inbox. It that the emails are already sorted correctly (the spam emails have been moved to the spam folder, and the inbox contains onally ham emails). If N is specified then it will only train on the first N emails in both the spam folder and inbox.

The following is a link to the video of the project, which shows the program being run, due to the time restrictions of the video i had to cut it down a little bit as it takes quite a while to run the program and i wanted to show both the test and run results.

< https://drive.google.com/file/d/1uxyGvEXHn29mlIPKdUz5F0M8Nmm9jdno/view?usp=sharing >

As you can see the performance of the program is around 80% mark. And it can classify 3 types of spam emails Advertising , Promotion and Pinching.

During the creating this project I conducted significant research into the different types of spam emails, and the methods they employ. I managed to find quite a lot of common information on this which I mentioned in my blogs, some of the incites i found include That Advertising and Promotion spam emails tend to be aimed at a large target audience rather than individuals, and as such tend to only have html sections. This is mainly due to the fact that most email clients support HTML now and they don't really care whether there emails reach everyone but rather that they just reach a majority of people. As such that can provide a useful indication. I also found that the types of spam emails found are different in different email accounts, and that email accounts tend see similar types of spam emails to what it is already receiving. Additionally due to the fact that the detector learns the correlations from the inbox means that, once trained the detector is essential linked to email account.

The blogs for the spam detector are bellow:

https://spatialshadow.tumblr.com/post/186580180264/spam-detector-blog-post-13-from-20172019

https://spatialshadow.tumblr.com/post/186556521474/spam-detector-blog-post-12-from-19072019

https://spatialshadow.tumblr.com/post/186555951469/spam-detector-blog-post-11-for-1772019

https://spatialshadow.tumblr.com/post/186346780679/spam-detector-blog-post-10-for-15072019

https://spatialshadow.tumblr.com/post/186346549089/spam-detector-blog-post-8-for-the-weekend-of

https://spatialshadow.tumblr.com/post/186345468584/spam-detector-blog-post-7-for-the-weekend-of

https://spatialshadow.tumblr.com/post/186344917889/spam-detector-blog-post-5-for-the-weekend-of

https://spatialshadow.tumblr.com/post/186344955689/spam-detector-blog-post-6-for-the-weekend-of

https://spatialshadow.tumblr.com/post/186231812344/spam-detector-blog-post-4-1272019

https://spatialshadow.tumblr.com/post/186185699149/spam-detector-blog-post-3-1072019

https://spatialshadow.tumblr.com/post/186181417654/spam-detector-blog-post-2

https://spatialshadow.tumblr.com/post/185904191684/spam-detector-blog-post-1

https://spatialshadow.tumblr.com/post/185620710789/something-awesome-project-proposal

0 notes

spatialshadow-blog · 5 years ago

Text

Community and Professionalism

Once again I have broken this section down into its critical sections

Collaborating on the Course environment

I placed my working to past papers on my public blog so that they could be accessed by other students. They can be found at the following blog links <https://spatialshadow.tumblr.com/post/186180061179/sample-midterm-exam-2015 > < https://spatialshadow.tumblr.com/post/186180023654/sample-midterm-exam-2016 >

< https://spatialshadow.tumblr.com/post/186180002474/sample-exam-2017-mid-sem-exam >

I also created summarising blog post on various components of the course and posted them on the public blog so that they would be publicly available. They can be found at the following blog links < https://spatialshadow.tumblr.com/post/186179571744/hash-function-attacks >

< https://spatialshadow.tumblr.com/post/186179524249/properties-of-cryptographic-hash-functions >

< https://spatialshadow.tumblr.com/post/186179511379/hashing-and-cryptographic-hashes >

< https://spatialshadow.tumblr.com/post/186179287514/bits-of-security >

< https://spatialshadow.tumblr.com/post/186179166154/rsa >

I posted security everywhere posts, although mostly they are on tumblr as i kinda forgot about twitter and forgot password.

Here are all the security everywhere posts

< https://spatialshadow.tumblr.com/post/185465794284/scam-email-convincing-or-not > - spam email

< https://spatialshadow.tumblr.com/post/185465804999/scam-email-not-much-information-this-time-but > - spam email

< https://spatialshadow.tumblr.com/post/185465941314/scam-email-need-i-say-more > - spame email

< https://spatialshadow.tumblr.com/post/185463550074/inventive-way-to-combat-scammers > - this is a post related to security that I thought was interesting and shared

<https://spatialshadow.tumblr.com/post/186601831939/security-everywhere-side-channels > - i found side channels in what i posted on tumbler

< https://spatialshadow.tumblr.com/post/186601924194/security-everywhere-i-forgot-my-twitter-password > - forgot password

Teamwork, communication and feedback

I participated in the tutorial group as we came up with ideas for case study questions in small groups

< https://spatialshadowblog.tumblr.com/post/186205899423/tw6>

< https://spatialshadowblog.tumblr.com/post/186179135933/tutorial-week-5 >

< https://spatialshadowblog.tumblr.com/post/185921145618/tutorial-week-4 >

< https://spatialshadowblog.tumblr.com/post/185626404898/reflection-on-tutorial-week-2 >

< https://spatialshadowblog.tumblr.com/post/185465184008/notes-on-tutorial-w1 >

A few people from the tutorial group decided to create a group chat so that we could collaborate and communicate on the course. I joined a chat group from the tutorial group and actively participated in answering and asking questions. https://spatialshadowblog.tumblr.com/post/186601343923/group-chat-evidence

Responsible data handling

While completing the something awesome project i was dealing and analyzing real emails, due to the amount of information in the emails I had to handle the information carefully and insure i didn’t disclose the information. I kept the emails on my computer and only put up the github the information that i was able to disclose. That can be seen through the repository. https://github.com/Derje/SpamDetectorBot

I also did not postup information of the emails to blogs for the same reason, and any instance where I did post information from the emails I made sure to eliminate any sensitive information from the post, only including what is absolutely necessary. Here is an example < https://spatialshadow.tumblr.com/post/186231812344/spam-detector-blog-post-4-1272019 >

I put all post related to lectures and tutorials in my private blog just in case I included some information in my notes that was not meant to be disclosed. Evidence for this is the blog in which the lecture and tutorial notes are in. e.g < https://spatialshadowblog.tumblr.com/post/185465184008/notes-on-tutorial-w1 > note the spatialshadowblog as appose to spatialshadow of other posts.

Another example of me showing the ability to handle sensitive data is where I blocked out the information of sender and receiver of phinching email samples i posted. Example of this can be seen here.

< https://spatialshadow.tumblr.com/post/185465794284/scam-email-convincing-or-not >

< https://spatialshadow.tumblr.com/post/185465804999/scam-email-not-much-information-this-time-but >

< https://spatialshadow.tumblr.com/post/185465941314/scam-email-need-i-say-more >

0 notes

spatialshadow-blog · 5 years ago

Text

Skills

I have improved my security skills (from probably what was essential 0 ) quite a lot throughout the course. I have broken this section down into the sections specified in the job application.

Technical ability

For my something awesome project i create a spam detector. This included weighting functions to connect to email clients, reading and interpreting email messages and writing ML functions. Some examples include:

https://spatialshadow.tumblr.com/post/186181417654/spam-detector-blog-post-2 - i talks about setting up the connection to the server

https://spatialshadow.tumblr.com/post/185904191684/spam-detector-blog-post-1 - i have the link to the GitHub repository for the project and talk about the fomating code for reading emails.

https://spatialshadow.tumblr.com/post/186346549089/spam-detector-blog-post-8-for-the-weekend-of - i show som output of the program as program that it is run and talk about a bit of ML

I created a hashing script for one of the modules weekly works to more easily solve the problem. The script implement the hashing function talked about in the module to allow me to solve the problem without working it out by hand. I talk about it here https://spatialshadow.tumblr.com/post/186179511379/hashing-and-cryptographic-hashes

I wrote a program that was vulnerable to format string attacks in c and tested attacking that program to display items from the text. There is a short post as evidence here https://spatialshadow.tumblr.com/post/186600557279/formatting-strings-and-buffer-overflow

I also completed the cookie challenge from the ctf’s

https://spatialshadow.tumblr.com/post/186600673714/cookie-challenge-proof

Whent to the reverse engineering workshop can see the blog post here. It was really fun to learn about this and i got further through than I actually thought i would have.

https://spatialshadowblog.tumblr.com/post/185626555253/reverse-engineering-workshop

Practical skills and Research

I made a few posts where I explained and summarized various components of the course in simple terms such that the concept could be understood by non security people. Some examples are. This also involvded research <https://spatialshadow.tumblr.com/post/186180061179/sample-midterm-exam-2015 > - working and my answers for prev midsem exam.

< https://spatialshadow.tumblr.com/post/186180023654/sample-midterm-exam-2016 > - working and my answers for prev midsem exam.

< https://spatialshadow.tumblr.com/post/186180002474/sample-exam-2017-mid-sem-exam > - working answers for prev midsem exam. This one I think is particularly interesting as I discovered that spell checkers could be used to cracking ciphers

< https://spatialshadow.tumblr.com/post/186179571744/hash-function-attacks > - detials of hash functions

< https://spatialshadow.tumblr.com/post/186179524249/properties-of-cryptographic-hash-functions > - summary of peoperties

< https://spatialshadow.tumblr.com/post/186179166154/rsa > - resurch on rsa

I also cracked quite a substitution through the course of this subject. I actually find this really hard to do … it is difficult to think of a word in terms of the way it is spelt

<https://spatialshadowblog.tumblr.com/post/185468367958/did-some-practice-cryptography-decryption-using>

0 notes

spatialshadow-blog · 5 years ago

Text

Time Management

During this trimester (T22019) i had to manage my time effectively between all of my courses as i had many competing deadlines all throughout the semester (particularly due to the fact that I have taken 3 content heavy subjects with high work loads), I mostly stuck to the convention of working on security wednesday, saturday and sunday (and sometimes fridays) as can be seen through the post dates of my blogs.

I showed my time management skills through the set deadlines for tasks within my something awesome project and was successfully able to meet my deadlines. This can be seen in the following examples of blog post:

https://spatialshadow.tumblr.com/post/186181417654/spam-detector-blog-post-2 - this blog set EOD goals.

https://spatialshadow.tumblr.com/post/186185699149/spam-detector-blog-post-3-1072019 - this blog is the EOD blog for previous link and set an EOD deadline for the friday

https://spatialshadow.tumblr.com/post/186231812344/spam-detector-blog-post-4-1272019 - this shows that I completed that deadline

https://spatialshadow.tumblr.com/post/186346780679/spam-detector-blog-post-10-for-15072019 - in this post i mention that i am on tract with my plan

I went to almost all the security lectures every week (the exception being week 4), writing up lectures notes during the lectures and posting them on my private blog. The links are as follows.

https://spatialshadowblog.tumblr.com/post/186488494453/l2w8

https://spatialshadowblog.tumblr.com/post/186481938713/l1w8

https://spatialshadowblog.tumblr.com/post/186325302908/l2w7

https://spatialshadowblog.tumblr.com/post/186318162048/l1w7

https://spatialshadowblog.tumblr.com/post/186179111798/l2w6

https://spatialshadowblog.tumblr.com/post/186154041678/l1w6

https://spatialshadowblog.tumblr.com/post/185998315383/l2w5

https://spatialshadowblog.tumblr.com/post/185991111483/l1w5

https://spatialshadowblog.tumblr.com/post/185921587368/l2w3

https://spatialshadowblog.tumblr.com/post/185674158778/before-the-lecture-look-at-security-soc-events - this l1W3

https://spatialshadowblog.tumblr.com/post/185535785283/lecture-2-week-2-security-notes

https://spatialshadowblog.tumblr.com/post/185514099748/security-lecture-1-week-2-notes-from-lecture

https://spatialshadowblog.tumblr.com/post/185464946813/notes-security-lecture-l2w1

https://spatialshadowblog.tumblr.com/post/185464351993/notes-security-lecture-l1w1

I was sick during week 4 and as a result I did not go to the lectures and was not blogging during that week. That also had the consequence in the next week of having to play catchup for this and other subjects. I had to prioritize my work and getting the work done to me was a higher priority than blogging about it. I bloged about it here < https://spatialshadowblog.tumblr.com/post/186179930188/not-been-blogging > throughout week 4 and 5 I was still considtently working on the subject, the blog showing the work I did during those weeks can be found here

<https://spatialshadow.tumblr.com/post/186180061179/sample-midterm-exam-2015 >

< https://spatialshadow.tumblr.com/post/186180023654/sample-midterm-exam-2016 >

< https://spatialshadow.tumblr.com/post/186180002474/sample-exam-2017-mid-sem-exam >

< https://spatialshadow.tumblr.com/post/186179571744/hash-function-attacks>

< https://spatialshadow.tumblr.com/post/186179524249/properties-of-cryptographic-hash-functions >

< https://spatialshadow.tumblr.com/post/186179511379/hashing-and-cryptographic-hashes >

< https://spatialshadow.tumblr.com/post/186179287514/bits-of-security >

< https://spatialshadow.tumblr.com/post/186179166154/rsa >

I went to All tutorials, writing blog post for all except week 3 (which i can't find for some reason … ) . And made a blog for case studies before the tutorial for most weeks where applicable (once again exception being week 7 and 8 ... )

< https://spatialshadow.tumblr.com/post/186178813624/case-studies-week-6-safer >

< https://spatialshadow.tumblr.com/post/185921075899/case-study-question-that-i-did-in-class >

< https://spatialshadowblog.tumblr.com/post/186205899423/tw6>

< https://spatialshadowblog.tumblr.com/post/186179135933/tutorial-week-5 >

< https://spatialshadowblog.tumblr.com/post/185921145618/tutorial-week-4 >

< https://spatialshadowblog.tumblr.com/post/185626404898/reflection-on-tutorial-week-2 >

< https://spatialshadowblog.tumblr.com/post/185718792713/doors-on-planes-case-study >

< https://spatialshadowblog.tumblr.com/post/185554277688/security-engineer-harry-houdini-analysis >

< https://spatialshadowblog.tumblr.com/post/185465184008/notes-on-tutorial-w1 >

< https://spatialshadowblog.tumblr.com/post/185379709763/deepwater-horizon-disaster-case-studies-analysis >

< https://spatialshadowblog.tumblr.com/post/186597994223/tutorial-w8 >

<https://spatialshadowblog.tumblr.com/post/186600245388/tw7>

0 notes

spatialshadow-blog · 5 years ago

Text

Analytical Ability

I Have shown my ability to think analytically in various aspects of the course, I have broken it down into the 3 aspects mentioned in the specifications section to make it clear.

Research

I conducted significant research and analysis of spam emails and existing spam detectors and articles while completing my Something Awesome Project. Although almost all the something awesome blog post contains some research, The following is a brief description of the research I conducted during the something awesome project with some examples blog post that provide supporting evidence.

I conducted research into existing spam detectors, particularly the popular and successful spamAssasson that tends to be used on the servers of email providers < https://spatialshadow.tumblr.com/post/185904191684/spam-detector-blog-post-1 >

I analyzed a bunch of emails in order to determine what features of the emails i could use to determine whether they are spam. However due to the amount of information contained within the email i decided against posting the images in case it breaches the good faith policy of course, but i did analyze the sections and blog about it. The following post contains my research and analysis into emails, particularly focused on the analysis of the content of advertising emails. In doing this I found quite a few features that advertising spam had in common (like having only an html component, and a high content of images) and used this insight to come up with insights as to how these features could be used to detect spam

<https://spatialshadow.tumblr.com/post/186185699149/spam-detector-blog-post-3-1072019 >

The Next link contains more research and analysis into email content to work out what features to look at for classification as well as examples of extending reasurch on an encoding i found in email content to deturmin if it could be used as an indicator of spam < https://spatialshadow.tumblr.com/post/186231812344/spam-detector-blog-post-4-1272019 >

I also conducted a bunch of research into how to actually accomplish the project. This includes things such as how to connect to the email server what type of ML algorithms to use as well as how ways in which they could be implemented. The following are some examples of such reasurch <https://spatialshadow.tumblr.com/post/186344955689/spam-detector-blog-post-6-for-the-weekend-of> < https://spatialshadow.tumblr.com/post/186181417654/spam-detector-blog-post-2 >

I also conducted research into various consent covered in the course inorder to gain a more detailed understanding of concepts introduced in lectures and content covered in various questions. Some examples of this can be found here <https://spatialshadow.tumblr.com/post/186179524249/properties-of-cryptographic-hash-functions> (summary of research)

<https://spatialshadow.tumblr.com/post/186179571744/hash-function-attacks > (summary of research overview )

<https://spatialshadow.tumblr.com/post/186180023654/sample-midterm-exam-2016> (research into answering question )

< https://spatialshadow.tumblr.com/post/186180002474/sample-exam-2017-mid-sem-exam > (contains an observation about the usefulness of using google docs or any document that spell check when cracking a transposition cipher >

Reflections

During the something Awesome project I reflected on all the research and observation that I had during construction and testing, and then change my approach to the task at hand. The evidence for something awesome in the previous section actually also provides evidence to this. Some examples of reflection from observation are < https://spatialshadow.tumblr.com/post/186580180264/spam-detector-blog-post-13-from-20172019 > < https://spatialshadow.tumblr.com/post/186231812344/spam-detector-blog-post-4-1272019 >

For the weekly case studies I read through the readings and reflected upon that information in order to answer the case studies questions and write blog posts on the information I learned for the readings. So examples are < https://spatialshadowblog.tumblr.com/post/185718792713/doors-on-planes-case-study >

< https://spatialshadowblog.tumblr.com/post/185554277688/security-engineer-harry-houdini-analysis >

< https://spatialshadowblog.tumblr.com/post/185379709763/deepwater-horizon-disaster-case-studies-analysis >

< https://spatialshadow.tumblr.com/post/186178813624/case-studies-week-6-safer >

< https://spatialshadow.tumblr.com/post/185921075899/case-study-question-that-i-did-in-class >

Application

As already shown above I applied my analisi of information when creating my spam bot for the something awesome project. And I applied my knowledge and analytical skills when completing the case studies as shown in the above examples.

The reverse engineering workshop is another example of where I applied knowledge of security that I had gained in the workshop to get into the application. < https://spatialshadowblog.tumblr.com/post/185626555253/reverse-engineering-workshop >

Another example of applying theory to real world problem was my experiment with format strings < https://spatialshadow.tumblr.com/post/186600557279/formatting-strings-and-buffer-overflow >

Yet another example would be using the information about cookies and how they are used by websites to identify users through a value to crack the cookie CTF challenge is another example. This also required research into how to edit the value of a cookie in the browser. <https://spatialshadow.tumblr.com/post/186600673714/cookie-challenge-proof>

0 notes

spatialshadow-blog · 5 years ago

Text

Security Everywhere - i forgot my twitter password

so from learning about how insecure passwords can be and that you should not involve words or things that can be related i tried using a random character password for twitter... unfortunately i have now forgotten the password

0 notes

spatialshadow-blog · 5 years ago

Text

Security Everywhere - side channels

was looking at the pictures i put in this blog and realized that they had side channels. you can see stuff through the reflections !!!!

0 notes

spatialshadow-blog · 5 years ago

Text

Cookie challenge proof

0 notes

spatialshadow-blog · 5 years ago

Text

Formatting Strings and Buffer Overflow.

so in the break in the lecture from week 7 where we talked about format string attacks, i decided to write a quick program that would be vulnerable to format string attacks to see how it works. and that when well and was quite interesting to see. however when testing the quick program i wrote i discovered that in my hast to write it i had also evidently made it vulnerabilities to buffer overflow attacks, and that GCC has a protection mechanism in it to prevent buffer overflows when default compiled (https://stackoverflow.com/questions/1345670/stack-smashing-detected)

following is the code ans the test

int main(int argc, char * argv[]){ char name[100]; printf("Enter your name: "); scanf("%s", name); printf(name); return 0; }

and some evidence to that i did it and when it was done

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 13 (From 20/17/2019)

Ok so doing a little bit of investigating in a few different email addresses that I got permission to access i found that the spam emails received appear to differ between different users. Additionally there is no set outline for what is considered spam emails. Now the detector works by determining features which are common amongst the spam emails, and then classifying the emails based on those factors. As such it is important for spam emails to be distinguishable for this to work. Existing spam detector such as SpamAssasan use many different factors to determine if an email is spam. Some of those methods include checking the validity of email address and isp against once they know to be spam etc. I am unable to implement these features as i simply do not have access to such data, and as such I can only use the information contained within the email as stand alone information. So in order to have my Bot detect “Spam” i needed to specify what precisely it was that I was considering spam, which was done by moving those emails to the spam folder in the email address.

So now i will specify the types of spam (and ham) that I am considering.

SPAM:

Advertising

Promotion/Giveaway

Phishing

HAM:

Articles

Communication emails

Forms

So basically anything that was attempting to “sell” something is classified as spam.

This actually made the detector function better as it was able to clearly distinguish between the types of emails. Now i know there are some instances where people subscribe to advertising and thus do not consider it spam. But in these cases for this purpose it would be better to simply include the email address into an exception list such that the email is not run through the detector if its sent from an email address in that list. Bellow are the final results for the classify as you can see it is performing well. It is worth noting that training it on another email address with a different separation would result in new connections being formed and thus different classification being learnt.

this images is of the network run on the hole test email account. of this account there where around 22 new emails (12 spam and 10 ham). note that of the 18 new emails only one of them was classified incorrectly.

note that the first result for the second image is the program run of the sample data that was extracted from another email address, and as such dose not actually conform to types specified above when the network is train on the test email, however it dose provide evidence of the networks performance on the a different address, as spam emails tend to be individualized for different accounts the network need to be trained to work on the account it is intended to be used on and the data in that account needs to be “clean” (classified as spam for a reason other than “i did not subscribe to this). the second run through displayed is actually run on the same data it was trained on so dose not actually provide any much useful information

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 12 (From 19/07/2019)

Ok so today i creating Decision tree classifier, separate the classes into separate files and added testing training and running functionality to the system.

So the decision tree takes in a list of features which it uses separate the emails into their respective classes. The feacher array i send into the tree is : <classification score, #img, #links, components, utf8 in subj, length of email text> for reasons previously discussed. The revised complete design diagram is included in this post

The program can also now be run in a testing mode which assumes that the emails are already separated into their respective folders (inbox=ham, spam folder=spam) and check how accurate the classification is, as well as a training mode which assumes the same thing and is used to train.

i ran the program and the performance was not great (see the image). after analyzing the data i found that this is probably due to the fact that i have not clearly defined what I considered to be spam. I was considering spam as any email that I don't remember the account being signed up for as spam. And it just so happens that there was no clear difference between what I was considering spam and not in the email themselves (other than where they were sent from). So the next step is to better clarify what precisely I am considering spam and not spam.

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 11 (For 17/7/2019 )

Today I mainly worked on attaining the data for training and getting the remainder of the model setup to perform classification based on the remaining parts of the email (images, html, ect). In a previous post I mentioned the idea of using data from <https://appliedmachinelearning.blog/2017/01/23/email-spam-filter-python-scikit-learn/> for training the rest of the classifier. However after taking a closer look at the data I realized that is actually not an entire email and does not contain the sections from the email in which are needed to train the remaining part of the classifier. In order to “train” the decision tree, i need sample spam and ham emails of as many forms as possible, in order to allow the ML algorithm to learn the correct relationships between data and classification. If the data does not span all types of data then there is the possibility for the ML algorithm not learning relationships but even learning incorrect relationships. I was unable to find a complete dataset and after thinking about it, I don't think I will be able to find a dataset due to the amount and type of information emails tend to contain. I have been trying to get spam emails into the test account but have been have not been very successful.

So I will get as many spam and ham emails as possible within the allotted time and train on that, but that likely means I will not have a lot of data for testing. I collected a few spam and ham emails from other accounts (20 each) to be used for training. I also started working on writing the spam detector class and the decision tree, am going to need to break the file up into separate files as it is getting a bit big to work with.

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 10 (for 15/07/2019)

Today most of my time was taken up by finding and fixing a bug that meant that the emails were being moved before all the uids were looked up causing the program to lose track of the email and crash, so not much done today.

Last blog post I stated that the performance on the real email account was much lower than has been initial anticipated now. it is worth noting that spam tends to be a fairly individualized thing, different people will have different types of spam with highly varied continence. as such if i have time it would be worth setting up a localized training method capable of using the data for the users email account in order to learn. This is something that I will attempt to implement in to my bot inorder to help with the correct classification on the real email account. It is also worth noting that this could be due to the data in the data set currently used being preprocessed or that the data is simply not large enough to create the correct classification. I will be exploring all of these possibilities next . the steps next to be complete are:

1 - analyze the test emails data in order to separate it into spam and not spam for testing training and evaluation.

2 - get more email data for testing that is more than just the text of the email (in other words get test data for the full email so that it can be used to train the spam_detector feature classifier.

3 - construct the spam_detector feature classifier that will be used to finally detering whether an email is spam or not spam, and measure the final performance for the system

And then it is done!! So yer so far on scheduled to be done by the end of the week like I planned. If I can find enough data, I may take the full emails and classify them in accordance with types of spam and create multiple text classifiers trained to look for featchers of the different types of spam as i feel like constricting the type of spam will increase the similarities it can check for and thus allow it to produce a better result, but this really depends more on my ability to find a significant enough sample size of data for each of the different types.

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 8 (for the weekend of 14/07/2019)

So I finished making the text classifier Text_classifier.

This involved 4 main steps:

1 - reading in the training data

This involved reading in the data from the data set and extracting the labels for that data.

2 - Preprocessing the model

This involved “normalising” the text of the email so that it is more comparable and thus can be used better in classification as the data space is enormous. Since the NB classifier is essentially comparing the words to a list of word probabilities for each specific class and classifying it as the class that has the most “similarities” to the text, it is obvious why increasing the compatibility of the text will aide in the classification.

As there are numerous ways of preprocessing a batch of text the method that if used was as follows:

-1- replace all numbers with numb and all $ with dollars in order to make any number in the email comparable to any other number and make $ comparable to the word written out.

-2- make the text all lower case so that the system is not case sensitive. (ie Security same as security or seCurity)

-3- remove the “stop words” (common words like a and the that will occur frequently in all documents but will have no bearing as to whether or not it is spam)

-4- word stemming, that is replacing different representation of a word for different contexts together. e.g the noun and the verb and ect variant of the word get replaced with the noun. This once again makes the text more comparable by reducing the variation in the text. Form my analysis of sample data it doesn't seem as if this will have any bearing on the class (in other words the variation of the word can’t be used to separate the data) but different spam emails use different variations of words themselves, as such eliminating these variations may make them appear more similar.

3 - Constructing the model

So i decided to manly follow the example for this and use the existing vectoriser as it will provide a more concise representation of the data that may in fact be more comparable. I also decided to put the model within a pipeline so as to keep it together and make it easier to store the model. I also decided to implement persistence for the mode, as training can take a long time and we don't want the model to retrain every time we run the program. Additionally I am currently exploring the idea that the emails from the user's inbox could be used for training themselves, in which case it would be crucial for the model to have persistence and learn over time. (but I will elaborate more on this in a later blog)

4 - Training and testing the model

I also set up the infrastructure to train and test the performance of the mode of the. Is set the training up to only be done if it can't find an existing mode and made a function to test the models performance against the datasets test set. The performance of the mode on the test set is shown in the image below, as you can see it performed remarkably well, accurately classifying the emails. However after running the uncompleted bot (classifying on only the text) on the test email account, I found that it classified almost every email as spam, even some that were clearly not. I had yet to have gone through the emails to determine whether or not they are all spam but the performance seems highly suspicious, and requires investigation

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 7 (for the weekend of 14/07/2019)

After conducting the previously mentioned research i formalizing the rest of the design for the structure of classes for the new system. Realising now that i never actually posted this design i thought i might just do this now and explain it.

It contains 4 main parts:

Imbox_Handler : this is the module that connects to email server and handles moving the data out of and around the email client. -> done

Email_data : this is the class used to store the information of a particular email. This includes getter functions to get the various elements of the email, functions to process the data form the email and functions that perform tests on the email itself. -> done

Text_Classifier : this is the NB classifier that takes a corpus of text and uses that to determine if it is spam or not. -> This is the bit that is i am currently working on.

Spam_Detector : this is the base class where everything is handled form, it will contain another ML classifier (most likely a Decision tree as mentioned in the previous posts) which will take the various tests from the Email data, its own tests and the result form the text_classifier to determine if a given email is spam email or not. If it is then this module will also use the inbox handler to move the email to the spam folder of the email client.

0 notes

spatialshadow-blog · 5 years ago

Text

Spam Detector Blog Post 5 (for the weekend of 13/7/2019)

So last weekend I was working on setting up the text classifier in order to work out the probability of something being spam based sole on the text. The words used within an email can provide a good indication of whether it is a spam email as some words (for example free) occur more often in spam emails than non spam emails. using a form of machine learning classification for this task will allow the program to determine for itself the words which provided the best indication.

I found that the Naive Bayesian classifier will be reasonably good for this task as it will determine the class based on the probability of the words it contains occurring in the given class, the only downside is that it requires a lot of data in order to accurately determine the probabilities. Due to the large amount of data required i download a dataset of spam and nonspam (referred to as ham) emails to be used by the NB classifier for training, as it seems unlikely that I will be able to acquire enough spam and ham emails myself within the time frame specified by this project (funnily enough when I want to get spam, I can't seem to get much). As a consequence the text classifier of the spam bot will need to classify spam generally, as i am unable to find sample data for spam emails of specific types.

0 notes