#and wrote a bunch of code that automatically formats it from a to b
Explore tagged Tumblr posts
Text
Now this may not seem like a massive change in format, but boy howdy do I feel proud of myself for writing the functions to automate this ehehehehehe (unfortunately for me ao3 ship stat op used a different formatting for the pre-2020 tables, so I'll have to write another function to sort those ones out too ToT)
#coding#ship stats#python#csv#ao3 ship stats#I'm gonna visualise their annual ship rankings data!#(and likely the tumblr fandom ones too)#but first I gotta clean the source data & build a data base#run all the raw data sets through my code#and then I will have a huge updated more uniform and complete data set#which I can then learn how to visualise#for data portfolio purposes#translation for non-coders: wow code is fun but it looks unimpressive if you don't know#I basically took the base text I copied off the ao3 ship rankings posts on ao3#and wrote a bunch of code that automatically formats it from a to b#into a format that's easier to work with in my code#to be able to put it into a proper database later
9 notes
·
View notes
Text
Version 300
youtube
windows
zip
exe
os x
app
tar.gz
linux
tar.gz
source
tar.gz
🎉 Merry v300! 🎉
I had a great week adding interesting new gui stuff.
system:known url
You can now search for files that have or lack certain known URLs! It comes under a new system predicate, 'system:known url', and supports multiple search types--exact url, domain, regex, or from a dropdown of the client's current url classes. I am really pleased with it.
The db URL search code is currently slow--maybe ten seconds on a big client with several hundred thousand stored URLs--but I have some ideas on how to speed it up.
autocomplete favourites
I have prototyped a way to select some favourite tags from the regular tag autocomplete interface. The 'results' list is now tucked into a page in a small notebook, and I've added a second 'favourites' list beside it. You can edit these favourites under options->tags and quickly switch between the results and favourites lists by hitting left or right arrow on an empty text input. If the favourites list is selected, it will capture arrow up/down and page up/down and enter key presses just like the results list, so it should be entirely possible to now enter a favourite tag just by pressing something like 'right, down, down, enter' on a fresh search page.
In the next week or so, I will add 'add to/remove from favourites' menu items for regular tag right-click menus (for quicker favourite editing), and some other little bells and whistles (like parent support in manage tags).
If this layout and workflow works out, I expect to do more here in future. I could add lists for popular tags, system predicates like 'size<8MB', and even whole named searches like 'short inbox videos' or 'last week's gifs'.
drag export from media viewer
I have added a new icon to the top hover window of the media viewer--a green arrow pointing right, added to the top-right of the window--that, if dragged from (i.e. just click the button and then drag out of it), will start a file export drag for the current file, the same as if you had dragged the thumbnail. So, if you are in the media viewer and want to upload the file somewhere via your web browser, you can now do it real quick.
import/export siblings and parents
This is for advanced users and may be buggy.
You can now import and export tag pairs to and from the tag siblings and parents dialogs! You can do it via clipboard or .txt files, with the format being a flat newline-separated taglist (so the pairs a->b, b->c would flatten in the .txt document to four lines abbc, and vice versa). You can also import in a 'add only' way, which will only ever add new pairs, rather than trying to petition or delete existing conflicts (as the code would do if you entered those same pairs manually in the text boxes).
I have tested this a bunch, but the underlying logic here is still a bit of a mess, so please do some test runs before you start importing or exporting ten thousand rows and let me know if you run into any trouble.
quick and dirty duplicate processing
I've added some ugly 'just set these thumbs as alternates/same/not dupes' buttons to the duplicate processing page, designed to be used with the 'show me some random pairs' button. It is a quick and dirty way of dealing with some common groups.
The database structure here is still inefficient for larger groups (like twenty alternates of the same file), but I have spent a bit of time thinking about the next step here and may try to slide in some updates for the most common problems in the coming weeks.
full list
wrote system:known url to find files that have--or do not have--certain types of urls. it works but is still a little slow--I can optimise it later!
added exact match, domain, regex, and url class search types for system:known url
added a button to the top media viewer hover window that will start a file export drag and drop event if dragged from
moved the autocomplete dropdown results list down into a paged notebook
wrote a new 'favourites' page tab for the autocomplete dropdown results
hitting left or right arrow keys on an empty text input will move between the results tabs
hitting arrow up/down/page up/down/home/page or passing mouse scroll events will now go to the current selected page
typing regular search text into the input will automatically return the current page to the search results list
moved the 'tag suggestions' part of the 'tags' options page to a new page
added 'tag favourites' to the 'tags' options page to edit which tags show in this new tab
added import/export buttons to the tag siblings and parents dialogs. they'll export to clipboard or .txt file, and import from the same with an additional option to add_only (i.e. to not delete/petition conflicts with the existing list)
added some quick-and-dirty 'set as alternates/same/notdupes' buttons to the duplicate filter, which will quickly apply that status to the dupes and show some more dupes
sped up db loading time of tag siblings and parents significantly
added a short delay check to tag siblings/parents regeneration so rapid regenerations (such as when processing certain admin-side petitions) can be merged
fixed an issue where similar_to searches could return results not in the current file domain
fixed some spinctrls that were sizing to thin
fixed a bug in the manage server services dialog that was incorrectly dealing with port conflicts on edit service dialog ok
added a clientside and serverside assertion to test that all the services on a serverside modify services call have unique ports
fixed an issue where hydrus network services without access keys would sometimes try to sync their accounts (this was messing up some admin server setup)
fixed some misc dialog window structure
messed around a little with how the autocomplete dropdown hides and shows when in float mode--I _think_ it will now be less flickery and will otherwise position itself and receieve focus better
converted the 'export files' dialog to the new sizing system and also made it non-modal (i.e. you can now interact with the rest of the program while it is open)
wrote a more rigorous force-fit-all-tlws command to the debug menu
misc fixes
misc refactoring
next week
I want to focus on the downloader overhaul, which is coming to the final big phase. Making the page of images downloader more intelligent is the first thing, and then actually getting going on the big gallery update, which we are basically ready to start.
In the run up to v300, I've been thinking about and talking with some users about how we got here. It is odd to think that I have been hammering at this thing for six or seven years, as it seems like the time has disappeared. Making hydrus is not easy, but I get a lot out of working on it, and I really appreciate your feedback and support--thank you. I am still in an ok situation IRL, so I hope to keep like this and just push steadily up to v350 and beyond.
2 notes
·
View notes
Text
Encrypted Garfield Comics Solve
http://pangenttechnologies.tumblr.com/post/158944128617
This post contains a Garfield comic strip, by Jim Davis, only edited with the text replaced with some type of cipher.
While players were working on this post, several more posts came through with similarly edited Garfield comic strips, five in total.
The comic strip from the first post turned out to be the debut Garfield comic strip from June 19, 1978
To decode the text in the edited version of the comics, the first step is a Vigenere cipher, with the original comic text as the key, taking each panel separately.
panel 1: MR TQIAE MVP WWG ZKBKTQIRO ELE A TRXNOBTIQMA HIZ LBSC LB VP GZO MNLM QITQ EZNI. Hi, there… I’m Jon Arbuckle. I’m a cartoonist, and this is my cat, Garfield. panel 2: OR TPHAFI MNI AHMNNMIUA CAXJ SDTM MBJ TMFK JU CO WWIATJ PPQOBQIA ERVG AAFQNL. Hi, there. I’m Garfield. I’m a cat, and this is my cartoonist, Jon. panel 3: XDR QRUY XQQYPQC IB CO ERCHATFMW YSDO. NEJV EPU. Our only thought is to entertain you. Feed me.
The decrypted text then consists of only letters A-J
panel 1: FJ AJEJA EJG IJG IJHIJFEJC EJE J ADJAGJAIDJH AAH DJGE JB CJ GIJ EJAJ JAAJ AIJA
panel 2: HJ AIDJBA AHI JCEJCJAIA AAEJ FAAF EJB BAHI JD JA IJAIAA BCJGIJEJ AJJA AJAIJA
panel 3: JJA CEJA EJCEJJJ AJ JA AEJDJAFEJ AEJJ. JAGJ ABA
These can then be converted to numbers A=1, B=2, etc. with J=0, and reversed to become ASCII decimal values
109 101 100 105 097 032 057 048 118 049 107 104 105 053 056 098 097 097 051 050 106
media 90v1khi58baa32j
109 101 100 105 097 032 119 109 104 098 122 056 116 051 119 103 053 098 112 049 108
media wmhbz8t3wg5bp1l
The third frame turns out to be a partial solution that requires additional information.
The first two panels, decode to mediafire strings. The first is http://mediafire.com/?90v1khi58baa32j
The file available is a ZIP archive named unlockering.zip containing a file unlockering.wav which is an audio file encoding data using the Kansas City Standard format. A python version of an encoder and decoder for this audio format was provided by Pangent in an early mediafire link available here: http://mediafire.com/?re6w37cv7aejo93
After decoding the audio into text, two sections are revealed. The first is more encoded text:
Zbm tmnvmo uhm vw cptikx Lnmsn. Aek ca n wptkfgf vridh xfm. Mpr lah kxzuzngth qk hibu iyta xtx qf flxva. Zbm ymtitb uhm jmtw jigws asst mxxm iah ftmq om knpltl Iawg. Gle qqd huty sf utrlz qf Enciykf. Prv bgmbj ca Faiha. Qny wai wxbe lovxc fjz lt npr fars btx iah gxvdkl zbyns bek ygrw ih Jlxca. N rejbbxyl osag eeu fqiis lqqn npr kigtp.
Which decodes as Vigenere with keyword ‘guineapix’ to:
The ginger one is called Daisy. She is a special needs pig. She has cataracts in both eyes and is blind. The little one with black nose ears and feet is called Lucy. The big ball of fluff is Annabel. Her breed is Swiss. The one with funky fur on the back end and ginger round the eyes is Boris. A neutered boar who lives with the girls.
Pangent Technologies tweeted the following image: https://twitter.com/pangenttech/status/845420515804119040 of four guinea pigs, so we now know the names of this fun-loving bunch.
Also, an edited version of this image had been revealed by Pangent:
The rest of the text from unlockering.wav is an ASCII-art image of a guinea pig
The second panel’s mediafire link is to http://mediafire.com/?wmhbz8t3wg5bp1l which contains another KCS encoded audio file, this time named breathless.wav with more Vigenere encoded text with password 'guineapix’ the decoded text is as follows:
%%%%%%%%%%%%%%%%%% Lottie, you’d like this. I was reading a book the other day. We’re in the information age, right? All of us, we’re just bombarded by too much information, all the time. Spam emails and social media. Doing really simple things becomes difficult, or complicated, or distracting. We’re all looking for a way to cut through all the noise. It’s easy to blend in and get ignored. It’s an atmosphere which rewards people who really stand out. You have to be bold and spicy and delicious. You’ve got to be a spirit. Can’t be a ghost. Dick Tracy’s on, I’m gonna shut the TV off. There are days when I think we could take over Pangent. The three of us. Take it over and make it right. Maybe that’s just a dream. Maybe that’s the thought that keeps me here. Our mutual friend Alex Schreyer has created a culture of secrecy and fear. He learned that from his dad. So we’re scared. Hiding behind layers of security. We don’t talk to each other. That’s not the way to make a scientific breakthrough. We need to share information. We need to talk to each other, brainstorm with each other. Care about one another. We could fix Pangent. Change everything about it. We could accomplish great things. But we’re scared. Scared of what we could actually do. Lottie, there’s something you wrote for the personality test. Technology. Information wants to be free. And progress belongs to the people. Companies will keep their secrets. They’ll copyright anything they can. But they’ll market it and sell it and blow it up all over the place. Well, information gets out, one way or another. Once you invent something, you can’t uninvent it. Sometimes you have to be real careful with what you put out into the world. %%%%%%%%%%%%%%%%%%
We now come to the second Garfield Post, from http://pangenttechnologies.tumblr.com/post/158944279592 This particular strip is from July 27, 1978, and while it may appear simple upon first glance, a deeper look into the strip may reveal many significant connections and an expansive depth of meaning which just might capture a glimpse of the divine. See here for just such an investigation (thanks Lasagna Cat!)
Decoding this strip in the same manner as the first, with the Vigenere key being the original text, we obtain more Letters which become digits, Panel 2’s result can be combined with the partial solution from panel 3 of the first strip to become:
109 101 100 105 097 032 053 056 112 107 100 051 056 104 051 100 100 053 051 053 100
media 58pkd38h3dd535d
(Panel 3 is separate and decodes to 'GRONSFELD!!’ which is another type of encoding, essentially a Vigenere cipher with a numeric keyword.)
Leading to http://www.mediafire.com/?58pkd38h3dd535d which contains a ZIP archive named withourterrors.zip containing a KCS-encoded audio file withouterrors.wav. (It is unclear why the archive filename is slightly different than the wav file name, but the difference leads to two possible interpretations of the filename: 'with our terrors’ or 'without errors’)
The Vigenere text here takes a new spin in that two known passwords are required for the keyword 'argentinamayorsassenheim’ is required to decode the text, which transcribes a conversation between Lottie and Leslie which reveals how strained their relationship has become, and ends in a rather shocking manner.
%%%%%%%%%%%%%%%%%% C: Drinking Bird update. Thirteen months without any significant incident. Lucky. I’ll raise a glass to that. The Cube has stabilized to a degree I wouldn’t have thought possible when we began studying it at the beginning of 2017. Reluctant as I am to admit it, Leslie Walsh (aka MayorSassenheim) is responsible for most of that. She’s written code that keeps the Cube stable without real peaks and valleys, and it seems the Cube is capable of learning from that. If I trusted her, I’d leave her to her work and be done with it. I wish I could do that. The situation would be simpler. Officially, I left Project 555 a long time ago. L: you pinged me why did you ping me C: I didn’t. I was talking about you. It must have pinged you automatically L: “Drinking Bird update” jesus lottie C: What? L: you shouldn’t even be doing this who is that even an update for C: For myself. L: of course C: talking to myself thinking about myself it’s what I’m good at. L: i didn’t say that. turn your mic on i’m switching to voice. C: Over a year without any kind of disaster. Leave me alone, I’m celebrating. L: You’re lurking. In the shadows of a project you were fired from how long ago? C: I can do what I want. L: Yeah, I think that’s actually your job title at this point right? Like we don’t know what to do with Lottie and we don’t want any trouble, so I guess Lottie does what she wants. C: Leslie, I’ve been avoiding this conversation for awhile, and I’ve thought long and hard about what to say. So, here it is. Fuck off and die, Leslie, and also fuck you. L: How’s the kid? C: Beautiful. She’s the sun and the moon and the sky and the stars. She’s about six full time jobs, but Eric helps. He’s not bad at the whole father thing. L: Good. I’m glad. I’d like to see her. C: I don’t think that’s a good idea. L: I wish you’d let me be a part of your life again. C: Well, shit happens. L: What was I supposed to do? C: The four of you stole the project out from under me. L: Is that what you think happened? C: And now he’s in charge of all of it. L: Officially, he always was. None of that changed. C: You took control of my work. L: And nobody died. Nobody got hurt. Like you said, nothing happened for thirteen months. What should I have done? Leave it all to him? Dammit Lottie, I did this for you. I worked myself to death on this project, just to make sure it wouldn’t kill anybody. C: It wasn’t your responsibility. It wasn’t your burden to take on. It’s mine. L: Well, shit happens. C: This project, it drains you, it changes you. You shouldn’t have had to deal with that. It’s wrong what you did, and it’s wrong what it did to you. To both of us. L: Lottie, I did my best. Under the circumstances I tried to do what you would do. In my head it’s like I was you. C: That’s ridiculous. For one thing you’re smarter than me. L: No, I’m not. My god, I’m not. C: You are. L: This is a weird thing to say, but like … do you ever feel like you’re in the wrong timeline? Like you’re in a science fiction book about, what if the Nazis won World War II or something? Sometimes it feels like, this wasn’t supposed to happen. Like everything is wrong. C: Yes. I do feel that way sometimes. I think it’s called depression. L: Why do you hate me? C: I don’t hate you. I don’t want to hate you. We were friends and I love you. L: Lottie. C: But we can’t even be friends now and it’s bullshit. It’s him. Alex. And Liam. On their own they’d be almost harmless. The two of them together, they amplify each other’s worst qualities. L: Oh believe me, I know. Multiplying and dividing by a pair of zeroes. C: God, them and every other asshole I had to deal with and still stay halfway sane. L: Project Four. C: Project fuckin Four! Working at Pangent, it’s poison. It poisoned you and it poisoned me. Maybe I do hate you, Leslie. I hate what you did. I hate that you did it better than I could have. L: I miss you. C: Don’t be dramatic. We were only friends to get back at him. L: We were friends because we were friends. C: You know what he said to me, a couple weeks before he took control of 555? He said, I know that you and Leslie will always be friends. Didn’t sound great, coming from him. I think somewhere inside me, I made my mind up to prove him wrong. L: We could have talked this out. Dammit, we were friends. We were supposed to protect each other. C: Whatever that was, that was a moment in time, and times change. You’re stuck in a moment and you think that moment will last forever. But really a moment only comes once. L: I saw you, poking around in the code. Like the whole entire time. You weren’t super subtle about it. Were you trying to get my attention? C: Probably. Yes. L: Alex saw it too, I’m sure he did. C: I don’t care. L: I mean like I kept him from seeing the last build you did. Well, the second to last build. You should really have hidden that better. C: I’m sure I don’t know what you’re talking about. L: He isn’t spying on this chatroom. C: You don’t know that. You fucked me over so many times. You let him in through every backdoor. L: I let him play. Like, I let the baby have his bottle. It was gonna be a fucked up situation either way. C: That doesn’t mean I can ever feel safe again. L: I get that. I’m sorry. C: You should be. I should be. There’s a lot of sorry to go around. L: You finished it, didn’t you? 555. That was supposed to be a final build. Then you erased it. C: I stopped it before it finished uploading. L: It was a finished build. C: Compiled without errors. L: I thought so. What would happen if we ran that code? C: I don’t know. It’s gone. Deleted. Forever. L: I have the partial upload. C: Of course you do. L: I saw you uploading. I mean I started downloading the minute the transfer started. And yeah, so it cuts off at a certain point. So I found the matching point in your previous build and put two and four together. C: See, this is why I can’t trust you. L: I mean I don’t know everything you changed in that build but like I figure it was all pretty top level stuff and bug fixes, whatever was left. My copy is spliced together, but do you think it’s identical to the build you deleted? C: I don’t know. L: It compiled without errors. C: Leslie, no. Oh, I can’t breathe. L: I’ll ask you again. What would happen if we ran that code? C: I don’t know. I don’t know if I want to know. It’s dangerous. L: Of course it is. I mean that’s been the whole problem from the start. C: Yes. L: You finished the code. All of it. And then you deleted it. C: Yes. L: Are you glad I kept the data? C: Yes. God, I’m actually shaking. I hope he’s not listening to this. L: So, what do we do from here? C: I don’t know. Leslie, I’m sorry for not talking to you like a human being for so long. I didn’t know what to say and after awhile, it was just … I put it off too long. L: Not too proud of any of this myself. For reals we’ve both got a lot to pologize for. C: There’s a lot of sorry to go around. L: Yeah but like, I’d like to see you. You, with Eric, and Rachel, and Sharon too. C: Who is Sharon? %%%%%%%%%%%%%%%%%%
!! shocking !!
The third Garfield comic post is here: http://pangenttechnologies.tumblr.com/post/158944845847 and depicts Odie’s first appearance in the strip on August 8, 1978.
For this comic, the text is all treated as one code such that the encrypted text:
NVCHAWJLLYOXNEYNTHIRNNSDRTJJSIWOZLYCIBPHHLREFXYVQLICTGZUAWSEZNAFSMHIYBQATBNLHOXQAAETIGXNEWDIUEJSENOTHXAFAMAHFRLJPAXQLAFSLYLDWSEHLEWSEHIVTHBVAPTBPWKAVECHKXNMAUITHG
decoded with key: isthatallyouhavetheonesuitcasenotexactlyhereboyohlawseylawseylawsey
decodes to:
FDJAADJAAAADGEDJAAEDAJAJJAHJAEJAGHBCGIEJAHAAEJAHJAIGBCBJAAAABCAJAIJAGIJAABCAJADJAFAABCJAAEJABCJAAAAADAADHBCABAHIBCJJAAJAHAA DAAAJAEAAAJADAABCAEIDBCDAAAJAGJAIIAAAFG
decodes to: 640114011114754011541010018015017823795018115018019723201111231019017901123101401611230115012301111141148231218923001101811 411101511101411231594234111017019911167
spaced according to the spacing in the edited comic, then reversed:
76 111 99 107 101 114 32 49 51 32 114 101 115 101 114 118 101 100 32 98 121 32 84 114 111 110 32 105 110 32 116 104 101 32 110 97 109 101 32 111 102 32 79 108 105 118 105 97 32 87 105 108 100 101 45 110 45 74 111 104 110 46
decimal ASCII decoded to:
Locker 13 reserved by Tron in the name of Olivia Wilde-n-John.
The fourth Garfield comic celebrates the Monday-hating cat’s love of Lasagna:
NPNUSTYOBAABMXJGKORDINERYFHUHTUMSTVANCIRMEAACYABOUXWATASGTVOSTPFTFNCBOOULUASBINJIPSUXTYPWAAVIAAHGPSDRNARHCDTFORISYUCWCEMVCYAZEIBUVVNBTBZFUVXSTYEYFEFTFOXDPASAPNDIMKWTXGRVTDVERJGKXRLQNARDI
with code imjustyouraverageordinarycatforinstanceimcrazyaboutnaturesmostperfectfoodlasagna
decodes to:
fdeaaaaahjagigjagaaaaaeaadhbcfdefbcaaaejacjadaaaaaejaagbcbjaaaabcajaijagijaabcajadjafaabcjaaejabcbbajaaajadaaaaafgbcjaaejahaagiagbcabahibcjjaajahaadaaajaeaaajadaabcbeidbcdaaajagjaiiaaafg
decodes to:
645111118017970171111151148236456231115013014111115011723201111231019017901123101401611230115012322101110141111167230115018117917231218923001101811411101511101411232594234111017019911167
reversed and spaced as in the edited post:
76 111 99 107 101 114 32 49 52 32 114 101 115 101 114 118 101 100 32 98 121 32 71 97 118 105 110 32 76 111 114 101 110 122 32 105 110 32 116 104 101 32 110 97 109 101 32 111 102 32 71 105 111 114 103 105 111 32 65 46 32 84 115 111 117 107 97 108 111 115 46
Locker 14 reserved by Gavin Lorenz in the name of Giorgio A. Tsoukalos.
The fifth comic posted is the first parts of a 1989 late October series of strips placing Garfield into a distinctly terrifying situation.
http://pangenttechnologies.tumblr.com/post/158946474637
GU RAT HFR EBA LQI SUI NHE RK UJ RSM OAN LNG CP HCA TNE VRE ZK OU BCI ONT IK SMO MBN ZN NEL MK KNM BQO RET PD NCO RCI OW NH RR TW FRE AWY JXD GN PO EJM HT PP NHO UQA CEN RID EJH SWA LXN HYO VC TI OD SH LEL DKR XAT PM RES FI
with key: BRRR THERE’S A CHILL IN HERE THIS MORNING WHAT AN EERIE SENSATION THIS DOESN’T FEEL LIKE MY HOME TO BE CONTINUED JON ODIE ANYBODY HOME I’M ALONE YOU HAVE NO IDEA HOW ALONE YOU ARE GARFIELD
decodes to:
FD AJA ABA AJA JJA HJA AAA AG BC JAA AJA DAA GI HJA GJA EJA HG BC BJA AAA BC AJA IJA GI JAA BC AJA DJA FAA BC JAA EJA BC JE ID GI CJA AJA IJA IG BC ABA HI BC JJA AJA HAA DAA AJA EAA AJA DAA BC CE ID BC DAA AJA GJA II AAA FG
decoded and reversed:
76 111 99 107 101 114 32 49 53 32 114 101 115 101 114 118 101 100 32 98 121 32 79 109 101 103 97 49 50 32 105 110 32 116 104 101 32 110 97 109 101 32 111 102 32 78 105 107 108 97 114 101 110 32 71 111 108 100 101 121 101 46
ASCII decimal for:
Locker 15 reserved by Omega12 in the name of Niklaren Goldeye.
Bringing us up to 15 of the 54 lockers reserved, and leaving us wondering at Sharon’s fate as we work to 'unlock the pigs’ a long-standing goal that is appearing to grow in significance and importance, and possibly risk.
1 note
·
View note
Text
Preparing Data for Machine Learning with BigML
At BigML we’re well aware that data preparation and feature engineering are key steps for the success of any Machine Learning project. A myriad of splendid tools can be used for the data massaging needed before modeling. However, in order to simplify the iterative process that leads from the available original data to a ML-ready dataset, our platform has recently added more data transformation capabilities. By using SQL statements, you can now aggregate, remove duplicates, join and merge your existing fields to create new features. Combining these new abilities with Flatline, the existing transformation language, and the platform’s out-of-the-box automation and scalability will help greatly to solve any real Machine Learning problem.
The data: San Francisco Restaurants
Some time ago we wrote a post describing the kind of transformations needed to go from a bunch of CSV files that contained information about the inspections of some restaurants and food businesses in San Francisco. The data was published by the San Francisco’s Department of Public Health and was structured in four different files:
businesses.csv: a list of restaurants or businesses in the city.
inspections.csv: inspections in some of previous businesses.
violations.csv: detected law violations in some of previous inspections.
ScoreLegend.csv: a legend to describe score ranges.
The post described how to build a dataset that could be used to do Machine Learning with them using MySQL. Let’s compare now how could you do that using BigML’s newly added transformations.
Uploading the data
As explained in the post, the first thing that you need to do to use this data in MySQL is defining the structure of the tables where you will upload it, so you need to care about the contents of each column and assign the correct type after a detailed inspection of each CSV file. This means writing commands like this one for every CSV.
create table business_imp (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
and some more to upload the data to the tables:
load data local infile '~/SF_Restaurants/businesses.csv' into table business_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,name,address,city,state,postal_code,latitude,longitude,phone_number);
and creating indexes to be able to do queries efficiently:
create index inx_inspections_businessid on inspections_imp (business_id);
The equivalent in BigML would be just drag and dropping the CSVs in your Dashboard:
And as a result, BigML infers for you the types associated to every column detected in each file. In addition, the types being assigned are totally focused on the way the information will be treated by the Machine Learning algorithms. Thus, in the inspections table we see that the Score will be treated as a number, the type as a category and the date is actually automatically separated into the year, month and day components, which are the ones meaningful in the ML processes.
We just need to verify the inferred types in case we want some data to be interpreted differently. For instance, the violations file contains a description text that includes information about the date the violation was corrected.
$ head -3 violations.csv "business_id","date","description" 10,"20121114","Unclean or degraded floors walls or ceilings [ date violation corrected: ]" 10,"20120403","Unclean or degraded floors walls or ceilings [ date violation corrected: 9/20/2012 ]"
Depending on how you want to analyze this information, you can decide to leave it as it is, and contents will be parsed to produce a bag of words analysis, or set the text analysis properties differently and work with the full contents of the field.
As you see, so far BigML has taken care of most of the work, defining the fields in every file, their names, the types of information they contain, parsing datetimes and text. The only remaining contribution we think of now is taking care of the description field, which in this case combines information about two meaningful features: the real description and the date when the violation was corrected.
Now that the data dictionary has been checked, we can just create one dataset per source by using the 1-click dataset action.
Transforming the description data
The same transformations described in the above-mentioned post can be applied now from using the BigML Dashboard. The first one is removing the [date violation corrected: …] substring from the violation’s description field. In fact, we can go further and use that string to create a new feature: the days it took for the violation to be corrected.
This kind of transformations was already available in BigML thanks to Flatline, our domain-specific transformation language.
Using a regular expression, we can create a clean_description field removing the date violation part
(replace (f "description") "\\[ date violation corrected:.*?\\]" "")
Previewing the results of any transformation we define is easier than ever thanks to our improved Flatline Editor.
By doing so, we discover that the new clean_description field is assigned a categorical type because its contents are not free text but a limited range of categories.
The second field is computed using the datetime capabilities of Flatline. The expression to compute the days that took to correct the violation is:
(/ (- (epoch (replace (f "description") ".*\\[ date violation corrected: (.*?) \\]" "$1") "MM/dd/YYYY") (epoch (f "date") "YYYYMMdd")) (* 1000 24 60 60))
where we parsed the date in the original description field and subtracted the one that the violation was registered in. The difference is stored in the days_to_correction new feature, to be used in the learning process.
Getting the ML-ready format
We’ve been working on a particular field of the violations table so far, but if we are to use that table to solve any Machine Learning problem that predicts some property about these businesses, we need to join all the available information in a single dataset. That’s where BigML‘s new capabilities come handy, as we now offer joins, aggregations, merging and duplicate removal operations.
In this case, we need to join the businesses table with the rest, and we realize that inspections and violations use the business_id field as the primary key, so a regular join is possible. The join will keep all businesses and every business can have none or multiple related rows in the other two tables. Let’s join businesses and inspections:
Now, to have a real ML-ready dataset, we still need to meet a requirement. Our dataset needs to have a single row for every item we want to analyze. In this case, it means that we need to have a single row per business. However, joining the tables has created multiple rows, one per inspection. We’ll need to apply some aggregation: counting inspections, averaging scores, etc.
The same should be done for the violations table, where again each business can have multiple violations. For instance, we can aggregate the days that it took to correct the violations and the types of violation per business.
And now, use a right join to add this information to every business record.
Finally, the ScoreLegend table is just providing a list of categories that can be used to discretize the scores into sensible ranges. We can easily add that to the existing table with a simple select * from A, B expression plus a filter to select the rows whose Score field value is between the Minimum_Score and Maximum_Score of each legend. In this case, we’ll use the more full-fledged API capabilities through the Python bindings.
# applying the sql query to the business + inspections + violations # dataset and the ScoreLegend from bigml.api import BigML api = BigML() legend_dataset = api.create_dataset( \ [business_ml_ready, score_legend], {"origin_dataset_names": { business_ml_ready: "A", score_legend: "B"}, "sql_query": "select * from A, B"}) api.ok(legend_dataset) # filtering the rows where the score matches the corresponding legend ml_dataset = api.create_dataset(\ legend_dataset, {"lisp_filter": "(<= (f \"Minimum_Score\")" \ " (f \"avg_score\")" \ " (f \"Maximum_Score\"))"})
With these transformations, the final dataset is eventually Machine Learning ready and can be used to cluster restaurants in similar groups, find out the anomalous restaurants, or classify them according to their average score ranges. Nevertheless, we can generate new features, like the distance to the city center, or the rate of violations per inspection. These transformations can help to better describe the patterns in data. Here’s the Flatline expression needed to compute the distance of the restaurants to the center of San Francisco using the Haversine formula.
(let (pi 3.141592 lon_sf (/ (* -122.431297 pi) 180) lat_sf (/ (* 37.773972 pi) 180) lon (/ (* (f "longitude") pi) 180) lat (/ (* (f "latitude") pi) 180) dlon (- lon lon_sf) dlat (- lat lat_sf) a (+ (pow (sin (/ dlat 2.0)) 2) (* (cos lat_sf) (cos lat) (pow (sin (/ dlon 2.0)) 2))) c (* 2 (/ (sqrt a) (sqrt (- 1 a))))) (* 6373 c))
For instance, modeling rating in terms of the name, address, postal code or distance to the city center could give us information about how to look up for the best restaurants.
Trying a logistic regression, we learn that to find a good restaurant, it’s best to move a bit away from the center of San Francisco.
Having data transformations in the platform has many advantages. Feature engineering becomes an integrated feature, so trivial to be used, and scalability, automation and reproducibility are granted, as for any other resource (and one click away thanks to Scriptify). So don’t be shy and give it a try!
0 notes
Text
Off Grid Sprint Update 16.8.2017 - Imma chargin ma lazer
Heyho! It’s been a long Sprint this one due to various galavanting and laser based activities, so apologies for the adjustment to our usual blogging schedule. The upside is - we have packed an incredible amount in to tell you about!!
Develop
This sprint started with us heading down to Develop conference in Brighton as Harry was giving a talk on 'Making a hacking game hackable'. The talk went incredibly well and generated lots of indepth questions from the audience.
We had the opportunity to meet up with all kinds of incredible folk and show them the game, including the ever lovely Dan Marshall who was extremely excited by the game and had these lovely things to say about us :D
That one is definitely going in olive wreathed award quotes!
Harry also found time to appear on Keir Miron (of Darkest Dungeon fame)’s podcast 'The Question Bus’ and was interviewed in a corridor at the conference! You can listen to it here.
SHA2017
We have been pushing a bunch of features forward in the game, especially within the modding toolset, and getting it ready for SHA2017, a hacker camp in the Netherlands that we were invited to. Rich and Harry spoke and ran workshops focusing on how hackers and modders can use our modding to tools to create interesting hacks that reflect real life vulnerabilities.
The talk and workshops went really well - we learned a heck of a lot and we got a fair few people in using the tools. You can watch the talk here and we'll be putting up a full blog post on the whole experience in the coming weeks.
Steelcon
In the run up to this, Rich headed up to Steelcon again to gather inspiration (the SHA talk references how one of the first hacking mods we made in Off Grid was based on a Steelcon talk about hacking the Nissan Leaf electric car by Scott Helme last year). There was loads of interesting brain juice this year, including our friend Darren Martyn’s hilarious talk on Hacking ACS Servers for World Domination and this unnerving talk by Ken Munro from Pen Test Partners about the awful vulnerabilities they had found in IoT sex toys...
Mod hacking workshop with Spoonzy
On top of having gotten a good start on testing the modding tools with Dominic during his work experience last month, we took this a step further and got our mate Spoonzy to come in and apply some hacking knowledge to crafting some hackable devices with the modding tools. Spoonzy spent the day with us and came up with some pretty epic ideas for hacks - we made a start on a couple of them and he prompted a few fixes that we got to work on ahead of our workshops at SHA.
New Networking
This included Harry working on switching the networked tools that allow modders to build their own level changes directly into the game over to our own networking code (much faster iteration time!). This took a little wrangling, but as well as speeding up the workflow, it has put in a couple of checks that prevent potential loss of work. The old way used Unity's built in TCP/IP networking code and made use of the editor’s play button in the modding project - if you have ever done any Unity development you will know that if you leave that on by mistake when working on objects or values in the scene then you will lose your work when you stop play mode on the editor - the new networking code avoids this pitfall entirely.
Tools!
We wrote an export tool that can now generate clean versions of LevelKit for us to submit to our steam tools section for the game. Meaning that any half-baked dev content is stripped out and our modders have a nice clean project they can work within.
Spawning
We also found that folk wanted to know what direction characters would face when spawned and so now allow users to visualise players and guards spawn rotation with an arrow marker on the mission object.
Templates
We wanted modders to be able to just jump straight into mission and hackable object scripting if they wanted to, rather than having to do any geometry or environment work and so made a template scene so modders can jump into a working mission with hackable devices already in it, ready to be edited, scripted and added to. It is just a basic couple of rooms with a few props and a basic laptop to hack into as an example but it is a good start. The laptop in it even had the SHA wiki on as the front page :P
All your data belong to us
All networked devices now have a data inventory - this is the first step in making characters dynamically populate the devices in the world with their personal data. This way, when you hack a device you are able to get specific information on the NPCs around you who have interacted with that device, building up a piture of their behaviours and routes, and what data or other devices you might be able to manipulate them with.
No Rest for the Wicked
And one for the books... we fixed guards giving up patrolling after roughly 5 patrol points. We found that they had not been given a place to rest and recoup their motivation so they were essentially lazy guards striking due to lack of coffee!
Progress and Affect
Modders can now update the game progress via lua, which means conversations, objectives, or other triggers and interections in your level can affect another mission. For example, finding a hidden space in a level or making specific conversational choices in a set of CryptoChat messages can be used to unlock new levels and side missions within the wider game. Essentially branching narratives and their effects can be set off by any lua triggers in a level.
We also added an Error checking pass on conversation system (making them now more robust). It’ll now be easier to avoid lua compile errors and get your conversation scripts in the game and conversing properly.
Wiki
The other thing we have done is some fairly comprehensive work on getting the Off Grid Wiki to a stage where it is useful to players and modders. We now have a bunch of articles which describe how you can go about making mods for the game. We'll update you soon once the structure of those articles is coming together!
Character customization
We are aiming to make all characters more customizeable for modders, and want to set up the colors in a way that would require as little as possible special skills or programs to change them for an existing character, while still using a single mesh & single material on each character in game.
Normal textures would do the job, but editing the colors on them afterwards isn't really for everyone. Vertex colors are easier, but editing has to be done with a 3D modelling tool and creating different color version of a character requires creating a different model (or messing with the vertex color data on the fly through code). Our solution was using a simple texture as color look-up table, so once the characters are set up correctly (by us) all one needs to do to change them is open the texture file in any image editor (even MS Paint would work great for this) and paint the tiles with new colors.
Since we don't need any detailed textures or anything, we can just assign the UV's of the models to those colored tiles. And in theory we only need one pixel per color so the textures can be really small. Although to make things easier to work with and avoid any issues with color bleeding when the images are compressed, we settled for a 64x64 texture (the resolution doesn't actually matter) with 8x8 grid of colored tiles. That's 64 colors per model (with 8x8 pixels each), which should be more than enough for our art style...
In addition to the color table, we also added support for an additional texture that works the same way, but is used to set the glossiness of the material, and to switch between metallic & non-metallic material.
...and both these textures are of course fully optional, all the characters have default textures built-in to the game so if the modders leave one or both textures out, we'll just use the defaults instead.
2nd Floor of the Apostle level
If anyone reading this has had a chance to play Off Grid, and made it through our newspaper office level, you might remember the 2nd floor of the building being pretty barren, to put it nicely. Properly decorating that part of the level and adding some actual gameplay has always been our plan, but there's just always been more important things to do.
Well, this sprint we finally decided it's time to add some content there. No spoilers, but let's just say that you can't just run through that floor any more, and trying to sneak past the guards would be pretty difficult as well. Instead you need to figure out a way to get the guards out of your way. There's a few possible solutions, like you'd expect in a stealth game, and we thinks there's room for some more as well.
Other changes & fixes
Added a bunch of new sound events for different parts of the UI
Text-only lines in the ncurses-like remote connection window are now correctly displayed using the foreground text color (as defined in the device's Lua script)
File viewer UI can now be closed with the "back" button (B button on gamepad, backspace on keyboard), and also closes automatically when the pause menu is closed.
Biz Rumblings
There has been a bunch of production and business development going on in parallel to all this, loads of interesting folk spoken to about opportunities for the game, but as always, you'll have to wait to hear more about that!
New *MOAR* blog posts
We have decided to shift the format here on the devblog slightly, we are planning on continuuing with these monthly sprint updates but maybe trying to make them a little more concise while also expanding on one or two of the items in each blog post in between each sprint update, so watch this space! We are going to start with Harry writing up a follow on from his Develop talk, the kinds of things he touched on, and his impressions of the conference as a whole, so look out for that!
Speak to you even sooner!
Rich, Pontus, Harry, and Sarah.
#developconf#Dan Marshall#sha2017#loading bar#Pen Test Partners#Darren Martyn#Scott Helme#modding#custom characters#hacking#hacking game#data privacy#mass surveillance#video games#gamedev#game design#indiedev#indiegames#offgridthegame#infosec#modded games#modders#moddable#Lua
0 notes