#if they could give me the raw data i could even program something to crawl that offline
Explore tagged Tumblr posts
Text
jellyneo doesn't allow scraping but says you can request an exception "in extremely rare cases," i wonder if the NBP would qualify
#nbp meta#i don't think i'd be making too many requests#like a few times per item to get the piccy and category and such#it's a countable finite number and wouldn't repeat requests#if they could give me the raw data i could even program something to crawl that offline#90% of this project is opening jellyneo and scrolling for info#i'd reach out immediately if i weren't worried that actually *getting* that data would give me a whole new subproject
1 note
·
View note
Text
Ratchet and Kim Possible: The Solanian Revolution-Part 20
They arrived at Qwark's Hideout. After they landed, they gave themselves a little bit a time to look out at the entire place. Kim: "Wow…this is…such a secluded place." Ron: "Yeah, this would seem like a good place to set up a secret hideout but did it have to be someplace so cold?" Kim: "That would make it even more difficult to seek him out." Ron: "Good point." Ratchet: "Come on, guys. Qwark has to be here somewhere."
They rushed out. As they ran through the area, they came across a few nasty, robotic hazards, such as some little pests that they did not think they would ever have to face. Computer voice: "Lawn Ninja Defense System Activated!" Ron: "Oh, no, not the Lawn Ninjas!" Kim: "Heh, so not the drama."
They fought through the little robotic pests and wrecked every single one of them without even a scratch, except for Ron. One of the Lawn Ninjas sliced his pants up almost completely. Ron: "Aw, man! Come on! Seriously?" Ratchet: "I was wondering when that would happen again." Clank: "I thought you guys managed to fix that problem." Kim: "So did I."
Thankfully, the Pants Regenerator was still functioning, so it repaired Ron's pants in a flash. Ratchet: "Well…at least Wade's Pants Regenerator is still working…come on, let's keep going before something else rips up his pants."
They continued on. Ron ran a little slower than the others. Ron: "Right behind you guys."
As they persisted through the area, they found themselves faced by a lot of Qwark bots. Not only did they look like him, but they talked like him, too. Kim: "You've got to be kidding me! Qwark bots?" Ratchet: "Hey, at least we know this is where Qwark set up his base if we didn't know that already." Ron: "Man, it's like a Qwark-infested nightmare!" Kim: "Heh, couldn't have said it better myself, Ron."
They kept going, fighting through many of the Qwark bots as well as several more Lawn Ninjas. They soon hit a roadblock in which they had to rely on Clank and the monkey to help them overcome. After the time extended effort, they were able to overcome the impasse and find Qwark. When they managed to reach him, he was wearing crimson long johns that also went over his head like his hero outfit. He was also wearing his gloves and shoes. He noticed his monkey almost immediately. Qwark: "Hey, little buddy! How did you find me?"
He soon noticed the group a little afterwards. Qwark: "Huh? Oh, it's you guys. What are you doing here?" Kim: "We were looking for you." Ratchet: "We found your secret Holo-vid, Qwark. That's how we were able to find this place." Qwark: "Oh…that…well, I didn't expect you guys to understand." Kim: "Understand what exactly?" Qwark: "You probably think, I would give anything to have a body like that, for one drop of his raw animal magnetism, an iota of…" Ron: "Uh, is he serious right now?" Kim: "I can't tell whether to say yes or no." Ratchet: "OK, enough! We don't want to hear any more of you mindless babbling!" Qwark: "Er…right, let's see, where was I…"
As he continued on and on, Ratchet, Kim and Ron exchanged displeased looks with each other. Qwark: "Ah, yes. But despite my outward appearance of utter perfection, well…um…" Ratchet: "Deep down, you're a cowardly wuss?" Kim: "That pretty much hits the nail on the head." Ron: "Agreed." Qwark: "Uh…no…not…exactly…when I escaped from that star cruiser, cheating death by mere nanoseconds, I suddenly realized something very important: I could have died! Me! Captain Qwark! Imagine an entire galaxy with no more…me." Kim: "Well, one galaxy fell into chaos because of you, so…an entire galaxy without you would have one less strife-inducing nut case." Ratchet: "Good one, Kim." Qwark: "And all of it for what cause? So a few trillion people get turned into robots? Who am I to say who should or should not be turned into a robot? Pffft! Whatever!"
The entire group was appalled by Qwark's apathetic attitude. Ratchet: "You know what? You're pathetic, Qwark! I can't believe I once looked up to you. Come on, guys, let's get out of here."
Ratchet walked out. Ron: "He used to look up to you? That's even more shocking than someone who looks up to me!"
Ron walked out after Ratchet. Qwark: "Uh…I…was it…something I said?" Kim: "Gee, I don't know, everything you've said and done so far clearly demonstrates that you're a bona fide hero." Qwark: "Really?" Kim: "No, of course not. I was being sarcastic, Captain Bozo. Maybe if you started actually being a real hero instead of just acting like one, maybe you would actually start getting some respect for a change."
Kim walked out. Clank was about to walk out, too, but then stopped and looked up at Qwark. Clank: "She is right. The people of this galaxy need you, Qwark. They believe in you. You can give them hope. You have a chance to redeem yourself and become the hero you have always wanted to be."
Clank walked out. The monkey screeched at him once before walking out with the group. Despondent, he looked towards his hero suit.
The group arrived back at their ship. As soon as they got back in, they immediately received a transmission from Sasha. Sasha: "Ratchet! Kim! Where have you been?" Kim: "Sasha!" Ratchet: "Uh…we were just…" Sasha: "Nevermind! The Phoenix is under attack! The shields are…"
The transmission was interrupted for a brief moment. Ratchet: "Sasha!"
The transmission reconnected. Sasha: "To 40%. Life supports are…I think they're getting…aboard the ship. Whatever's on that disk has Nefarious worried…do anything to get it back…Hey, hotshot, if I don't see you again, I just want to tell you…"
The transmission was cut off and did not reconnect again. Kim: "Sorry, Ratchet…I'm afraid…the signal was lost…" Ratchet: "Kim, engage the gravimetric warp drive! Let's get back to the Phoenix as fast as possible!" Kim: "Sure thing, Ratchet!"
They took off and left for the Phoenix.
After returning, they saw that the entire place was under red alert. They also found some of Nefarious' bots throughout the place. Kim: "Whoa…Nefarious has really hit this place hard." Ratchet: "Come on, we have to make sure Sasha and the other crew members are alright."
They rushed out. Ron, however, became easily side-tracked. Ron: "Oh, no! The VG-9000 and virtual menu! I have to make sure they're both OK!"
Ron ran towards their quarters but Kim grabbed him by his shirt collar and pulled him the other way.
They rushed towards the bridge but found that the access transporter was broken. Ron: "Oh, no! They broke the transporter to the bridge!" Kim: "We'll have to find another way to reach the bridge!" Ratchet: "Then, let's get going! We can't stand here and do nothing!"
They hurried along through the newly opened area where they found a crawl space to enter. There, they fought off more robots as they proceeded through. This was where Kim and Ratchet had to really work together. They fought their way through some vicious robots to until eventually, they made it to the bridge.
As soon as they entered, they found Sasha, Helga and Al hiding below the walkway to the captain's seat. Helga: "Well, you certainly took your sweet time." Ratchet: "Hey, it's good to see you, too."
The group walked towards them. Kim: "Is everyone OK?" Sasha: "We're fine, Kim. You guys made it just in time." Ratchet: "Any luck with the data disk?" Al: "Hmph! "Luck" he says." Ron: "So…that's a no?" Al: "No!? Do you think I would be incapable of solving this thing? Hilarious!" Sasha: "Al cracked the encryption. The disc contains a complete copy of Nefarious' battle plan." Kim: "Whoa…Nefarious had every right to be worried." Ron: "Oh, boy, you can say that again." Sasha: "He's going to attack planet after planet, leaving nothing but robots in his wake." Kim: "Well, that seemed pretty obvious." Sasha: "There's more to it. The Biobliterator is so well protected, Nefarious doesn't believe there is any chance we can stop it." Kim: "Oh, I'm sure we'll find a way." Ratchet: "Can we really?" Al: "I estimate our odds at approximately 1 in 63 million, give or take." Kim: "Is that all? I've handled worse." Ratchet: "I don't know, Kim. This may be too much, even for you." Kim: "Even if that is true, there has to be a way to stop it. I won't give up so easily, even if the odds are far beyond anything we can handle." Al: "Um, perhaps there is something. The Biobliterator is programmed to recharge its power cells after each attack." Kim: "Well, there you go. We've got ourselves a potential weakness." Ron: "So, does it have a specific place where it recharges its batteries, or does it have several scattered charge stations?" Sasha: "It's currently recharging at a base on Planet Koros." Kim: "Then that's where we need to go."
They were about to walk out. Ratchet: "By the way, did a little girl and her blue dog stop by here recently?" Sasha: "Um, yeah, they did actually. They gave us a few devices they said would make us immune to the Biobliterator. She said she and her friends are delivering them to everyone in the galaxy, but I don't think they'll be able to deliver them to everyone in time and for some, it's already too late." Ratchet: "Then we have no time to lose. We need to leave now." Sasha: "You guys should hurry because…the next target…is Veldin."
Ratchet, shocked to hear this, growled in anger. Kim: "We'll see to it that attack won't happen." Ratchet: "You said it, Kim."
They rushed out of the bridge and returned to the ship as fast as they possibly could. They then left the Phoenix.
#Kim Possible#Ratchet and Kim Possible Chronicles#Ratchet and Clank#Up Your Arsenal#Starship Phoenix#Ratchet#Clank#Ron Stoppable#Skrunch the monkey#Captain Qwark#Captain Sasha#Helga#Solana Galaxy
1 note
·
View note
Text
The Start of Something Awful
@werewolfpine said I should post my writing and I’m doing it because I will literally never post unless someone forces me to, here’s a snippet of the lore of How Doc Ock Comes To Be, featuring my/Ock’s actual mind thought mannerisms. Technically this has only my S/I Oliver and Doctor Octavius a little at the front end, because I do my best work when it’s one character who thinks too much.
Word Count: ~1.4k Warnings: Self-harm (minor), queasiness (minor), astonishingly sarcastic narrator voice
“Hoshino.”
Oliver looks up from the box; as much as people focused on biorobotics, he rather preferred the metal things he’d been working on. None of that confusion of the ‘bio’ aspect. Cold techne and cold metal, a perfect compliment to his frozen heart. Looked up at his teacher- professor, Otto Octavius, and said nothing.
“The test results..?”
Of course! How could he possibly forget the mind-crippling endeavor of writing up a lab report for the sake of his dear professor? It would never pass off as science if he didn’t suffer the hideous toil of turning his experiment into a report; let it be known that the gods themselves would forbid anyone to simply look at the raw data and draw their own conclusions- no, he has to bring their attention to that all himself.
Ability to self-replicate- [Y] Hive mind program- [Y] Formation of simple and complex shapes- [Y] Link to human minds- concept phase. Mobile complex shapes- concept only. Modify macro chemicals within human body- tested in organic slurry, dubious results. Anything else that could be interesting- hasn’t been conceptualized yet.
“Would you call it a success?”
“If it teaches something new, it is a success.”
“Then have you been taught anything?”
Oh, doctor, do not pretend! This is all just a reinvention of the wheel at this point. Smaller and still programmable they may be, but these are all things that have been done before. They were done decades ago, before everyone found biological machinery to hold more promise. What then is there to learn? Humans disagree with metal, that has been the lesson. Oliver answers in so dry a tone; “discussion section: page three. Sir.”
“So I read.” Oliver returns his attention to his robots, still attentive to the good doctor’s words; “you sound irritated- both in the paper and at present.”
It is proper to smile and shake his head, to set the doctor’s concerns to rest. He fails this task, and in the same dry tone; “I’m not. I have concerns that this research is dated at best.”
“Then you are-.”
Interrupting, and how uncharacteristic that was- “I don’t have the time to be emotional, in any event.” The professor seemed off-put by that. Indeed, it was rude of Oliver to interrupt; he makes note of that, and fails to realize that describing himself as necessarily emotionless might instead be the reason for the doctor’s discomfort. Even the good Doctor Octavius had room to be emotional when good or ill fortune struck.
There was a pause, a little too long, before the doctor spoke- he’d turned back only to give a half-question; “I trust that you can be left alone in the lab, Oliver?”
“Yes, Dr. Octavius.” Really, this was such a dumb question. Could Oliver be trusted? Of course not; every faculty member would agree, if they only knew the contents of his mind. Which made it a rather good thing, how very skilled in keeping his thoughts under lock and key he was. Not with his friends of course. With friends you were expected to share a certain amount of information, and in turn they shared meaningless data points that helped one curry good favor if one kept it all in mind. What a fun game that was, sifting through all that data and hoping you came across anything of interest.
Ah. And he was alone. The professor had left without him noticing.
“And if I am consumed by the plague I now set loose upon the earth, thus was my fate since the moment I was born; not God nor Man could stop me or my creations; Pandora, I call upon thee.” He was alone, could he not be dramatic? The box was opened, and the robots did... Absolutely nothing.
Oh good, they hadn’t developed sentience while he acted out his drama.
A scalpel he’d pilfered from his sibling on a recent trip home; it was perhaps not the cleanest, but it would serve to sever, given he’d sharpened it against bricks and stones when he’d had a moment to do so. The only issue now was to shut down his self-preservation instincts, which barely allowed a scratch to be made against himself. But not seeing the place he would cut made easier the act, and he cut into the skin that made up the hair line just behind and below his right ear.
The incision was easier than he’d expected, perhaps because it was so much closer to his dreams’ completion than anything else had been before. He pretends to be surprised by the blood, but to what end? No one is around.
He starts his computer up, watches the robots come to life, and opens up the file “Concept_Phase.chk”. Checkpoint reached, your game will now auto-save, he hums; for the first time he feels the striking chill of fear. He thinks perhaps it is the first time in his life, but knows instinctively this cannot be the case. Either way, one error at this point would be so much more devastating.
They were crawling into that bloodied cut now. He should have worn a different shirt, but at least the black on this one might spare the rest from carrying a stain. They were a horrible itching sensation in his skin- he forces his hand stationary, to meddle now is more threatening. It is most threatening; he does not understand the limits of the human body, but he does understand the delicacy of the brain.
And they are in his brain.
That is the most terrifying part of it all, and he suppresses the urge to vomit. Brains are such delicate things and he has put so many bits of metal into his. He suppresses the urge to stand and run from this horrible thing that he has done. He stays stock still, and feels fear in every muscle and every nerve ending of his body.
And they are in his brain.
He woke up, cold, and pushed himself off the floor. Linoleum or plastic tile- didn’t matter, it was cold. He almost felt annoyance- hadn’t he been doing something? It was awfully uncharacteristic of him to sleep in the lab. The computer lab, maybe, but this wasn’t that.
Oh fuck the robots and the cut- he grasps at his neck, drawing his hand away with the full expectation to pull away half-scabbed gunk, or blood still running. Nothing. He sighed. Maybe it was another dream- maybe he was still dreaming. Dreaming of being something worthy of pride and love, instead of the falsehood he’d built himself into. Of being a worthwhile investment on the part of his parents and friends. Of being something better than this, whatever this was.
Log onto his computer- and how very strange! He’d never run the checkpoint file before, if it was a dream, so why was there a .log version now? It was suddenly beginning to feel very much not like a dream. Uneasiness, like so many maggots in his stomach, seemed to eat at him. He reached up and closed the box that had once been the house of his pride, and scanned over the .log file.
Program terminated successfully.
Oh thank the gods and devils both. It was successful.
But they were in his brain, now. Theoretically, he should be able to interact with them, if all had gone according to plan. He tried not to think about how unsanitary last night’s actions were, or rather to think about that instead of the presence of so much non-biological material now swarming around in his skull. He could feel the crawling- the sensation of parasites under his skin, but how much of that was simply psychological? He couldn’t say.
“Not nearly enough time to run any sort of experiment on them,” he sighed; class would begin soon. Sure, he was already in the building, but still. “How disappointing. How many are left in there?” He finally bothered to stand up and check the box; maybe if he… tried to input commands to those ones? There were still plenty in there; doesn’t take that much metal to make a computer chip inside one’s head then.
They stirred, sluggish and confused. They had never moved of their own accord before... Responsive? Again, move again- and they did. They swayed with little ripples, ocean waves almost.
Link to human minds- [Y].
#[[ Human they Say | Oliver Hoshino ]]#byteverse#caffeinated writing#if people want more then they gotta tell me because I will assume 'no' unless told 'yes'#I don't know why it's not tagging byte but
3 notes
·
View notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
2 notes
·
View notes
Photo
From crawl completion notifications to automated reporting: this post may not have Billy the Kid or Butch Cassidy, instead, here are a few of my most useful tools to combine with the SEO Spider, (just as exciting). We SEOs are extremely lucky—not just because we’re working in such an engaging and collaborative industry, but we have access to a plethora of online resources, conferences and SEO-based tools to lend a hand with almost any task you could think up. My favourite of which is, of course, the SEO Spider—after all, following Minesweeper Outlook, it’s likely the most used program on my work PC. However, a great programme can only be made even more useful when combined with a gang of other fantastic tools to enhance, compliment or adapt the already vast and growing feature set. While it isn’t quite the ragtag group from John Sturges’ 1960 cult classic, I’ve compiled the Magnificent Seven(ish) SEO tools I find useful to use in conjunction with the SEO Spider: Debugging in Chrome Developer Tools Chrome is the definitive king of browsers, and arguably one of the most installed programs on the planet. What’s more, it’s got a full suite of free developer tools built straight in—to load it up, just right-click on any page and hit inspect. Among many aspects, this is particularly handy to confirm or debunk what might be happening in your crawl versus what you see in a browser. For instance, while the Spider does check response headers during a crawl, maybe you just want to dig a bit deeper and view it as a whole? Well, just go to the Network tab, select a request and open the Headers sub-tab for all the juicy details: Perhaps you’ve loaded a crawl that’s only returning one or two results and you think JavaScript might be the issue? Well, just hit the three dots (highlighted above) in the top right corner, then click settings > debugger > disable JavaScript and refresh your page to see how it looks: Or maybe you just want to compare your nice browser-rendered HTML to that served back to the Spider? Just open the Spider and enable ‘JavaScript Rendering’ & ‘Store Rendered HTML’ in the configuration options (Configuration > Spider > Rendering/Advanced), then run your crawl. Once complete, you can view the rendered HTML in the bottom ‘View Source’ tab and compare with the rendered HTML in the ‘elements’ tab of Chrome. There are honestly far too many options in the Chrome developer toolset to list here, but it’s certainly worth getting your head around. Page Validation with a Right-Click Okay, I’m cheating a bit here as this isn’t one tool, rather a collection of several, but have you ever tried right-clicking a URL within the Spider? Well, if not, I’d recommend giving it a go—on top of some handy exports like the crawl path report and visualisations, there’s a ton of options to open that URL into several individual analysis & validation apps: Google Cache – See how Google is caching and storing your pages’ HTML. Wayback Machine – Compare URL changes over time. Other Domains on IP – See all domains registered to that IP Address. Open Robots.txt – Look at a site’s Robots. HTML Validation with W3C – Double-check all HTML is valid. PageSpeed Insights – Any areas to improve site speed? Structured Data Tester – Check all on-page structured data. Mobile-Friendly Tester – Are your pages mobile-friendly? Rich Results Tester – Is the page eligible for rich results? AMP Validator – Official AMP project validation test. User Data and Link Metrics via API Access We SEOs can’t get enough data, it’s genuinely all we crave – whether that’s from user testing, keyword tracking or session information, we want it all and we want it now! After all, creating the perfect website for bots is one thing, but ultimately the aim of almost every site is to get more users to view and convert on the domain, so we need to view it from as many angles as possible. Starting with users, there’s practically no better insight into user behaviour than the raw data provided by both Google Search Console (GSC) and Google Analytics (GA), both of which help us make informed, data-driven decisions and recommendations. What’s great about this is you can easily integrate any GA or GSC data straight into your crawl via the API Access menu so it’s front and centre when reviewing any changes to your pages. Just head on over to Configuration > API Access > [your service of choice], connect to your account, configure your settings and you’re good to go. Another crucial area in SERP rankings is the perceived authority of each page in the eyes of search engines – a major aspect of which, is (of course), links., links and more links. Any SEO will know you can’t spend more than 5 minutes at BrightonSEO before someone brings up the subject of links, it’s like the lifeblood of our industry. Whether their importance is dying out or not there’s no denying that they currently still hold much value within our perceptions of Google’s algorithm. Well, alongside the previous user data you can also use the API Access menu to connect with some of the biggest tools in the industry such as Moz, Ahrefs or Majestic, to analyse your backlink profile for every URL pulled in a crawl. For all the gory details on API Access check out the following page (scroll down for other connections): https://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/ Understanding Bot Behaviour with the Log File Analyzer An often-overlooked exercise, nothing gives us quite the insight into how bots are interacting through a site than directly from the server logs. The trouble is, these files can be messy and hard to analyse on their own, which is where our very own Log File Analyzer (LFA) comes into play, (they didn’t force me to add this one in, promise!). I’ll leave @ScreamingFrog to go into all the gritty details on why this tool is so useful, but my personal favourite aspect is the ‘Import URL data’ tab on the far right. This little gem will effectively match any spreadsheet containing URL information with the bot data on those URLs. So, you can run a crawl in the Spider while connected to GA, GSC and a backlink app of your choice, pulling the respective data from each URL alongside the original crawl information. Then, export this into a spreadsheet before importing into the LFA to get a report combining metadata, session data, backlink data and bot data all in one comprehensive summary, aka the holy quadrilogy of technical SEO statistics. While the LFA is a paid tool, there’s a free version if you want to give it a go. Crawl Reporting in Google Data Studio One of my favourite reports from the Spider is the simple but useful ‘Crawl Overview’ export (Reports > Crawl Overview), and if you mix this with the scheduling feature, you’re able to create a simple crawl report every day, week, month or year. This allows you to monitor and for any drastic changes to the domain and alerting to anything which might be cause for concern between crawls. However, in its native form it’s not the easiest to compare between dates, which is where Google Sheets & Data Studio can come in to lend a hand. After a bit of setup, you can easily copy over the crawl overview into your master G-Sheet each time your scheduled crawl completes, then Data Studio will automatically update, letting you spend more time analysing changes and less time searching for them. This will require some fiddling to set up; however, at the end of this section I’ve included links to an example G-Sheet and Data Studio report that you’re welcome to copy. Essentially, you need a G-Sheet with date entries in one column and unique headings from the crawl overview report (or another) in the remaining columns: Once that’s sorted, take your crawl overview report and copy out all the data in the ‘Number of URI’ column (column B), being sure to copy from the ‘Total URI Encountered’ until the end of the column. Open your master G-Sheet and create a new date entry in column A (add this in a format of YYYYMMDD). Then in the adjacent cell, Right-click > ‘Paste special’ > ‘Paste transposed’ (Data Studio prefers this to long-form data): If done correctly with several entries of data, you should have something like this: Once the data is in a G-Sheet, uploading this to Data Studio is simple, just create a new report > add data source > connect to G-Sheets > [your master sheet] > [sheet page] and make sure all the heading entries are set as a metric (blue) while the date is set as a dimension (green), like this: You can then build out a report to display your crawl data in whatever format you like. This can include scorecards and tables for individual time periods, or trend graphs to compare crawl stats over the date range provided, (you’re very own Search Console Coverage report). Here’s an overview report I quickly put together as an example. You can obviously do something much more comprehensive than this should you wish, or perhaps take this concept and combine it with even more reports and exports from the Spider. If you’d like a copy of both my G-Sheet and Data Studio report, feel free to take them from here: Master Crawl Overview G-Sheet: https://docs.google.com/spreadsheets/d/1FnfN8VxlWrCYuo2gcSj0qJoOSbIfj7bT9ZJgr2pQcs4/edit?usp=sharing Crawl Overview Data Studio Report: https://datastudio.google.com/open/1Luv7dBnkqyRj11vLEb9lwI8LfAd0b9Bm Note: if you take a copy some of the dimension formats may change within DataStudio (breaking the graphs), so it’s worth checking the date dimension is still set to ‘Date (YYYMMDD)’ Building Functions & Strings with XPath Helper & Regex Search The Spider is capable of doing some very cool stuff with the extraction feature, a lot of which is listed in our guide to web scraping and extraction. The trouble with much of this is it will require you to build your own XPath or regex string to lift your intended information. While simply right-clicking > Copy XPath within the inspect window will usually do enough to scrape, by it’s not always going to cut it for some types of data. This is where two chrome extensions, XPath Helper & Regex- Search come in useful. Unfortunately, these won’t automatically build any strings or functions, but, if you combine them with a cheat sheet and some trial and error you can easily build one out in Chrome before copying into the Spider to bulk across all your pages. For example, say I wanted to get all the dates and author information of every article on our blog subfolder (https://www.screamingfrog.co.uk/blog/). If you simply right clicked on one of the highlighted elements in the inspect window and hit Copy > Copy XPath, you would be given something like: /html/body/div[4]/div/div[1]/div/div[1]/div/div[1]/p While this does the trick, it will only pull the single instance copied (‘16 January, 2019 by Ben Fuller’). Instead, we want all the dates and authors from the /blog subfolder. By looking at what elements the reference is sitting in we can slowly build out an XPath function directly in XPath Helper and see what it highlights in Chrome. For instance, we can see it sits in a class of ‘main-blog–posts_single-inner–text–inner clearfix’, so pop that as a function into XPath Helper: //div[@class="main-blog--posts_single-inner--text--inner clearfix"] XPath Helper will then highlight the matching results in Chrome: Close, but this is also pulling the post titles, so not quite what we’re after. It looks like the date and author names are sitting in a sub
tag so let’s add that into our function: (//div[@class="main-blog--posts_single-inner--text--inner clearfix"])/p Bingo! Stick that in the custom extraction feature of the Spider (Configuration > Custom > Extraction), upload your list of pages, and watch the results pour in! Regex Search works much in the same way: simply start writing your string, hit next and you can visually see what it’s matching as you’re going. Once you got it, whack it in the Spider, upload your URLs then sit back and relax. Notifications & Auto Mailing Exports with Zapier Zapier brings together all kinds of web apps, letting them communicate and work with one another when they might not otherwise be able to. It works by having an action in one app set as a trigger and another app set to perform an action as a result. To make things even better, it works natively with a ton of applications such as G-Suite, Dropbox, Slack, and Trello. Unfortunately, as the Spider is a desktop app, we can’t directly connect it with Zapier. However, with a bit of tinkering, we can still make use of its functionality to provide email notifications or auto mailing reports/exports to yourself and a list of predetermined contacts whenever a scheduled crawl completes. All you need is to have your machine or server set up with an auto cloud sync directory such as those on ‘Dropbox’, ‘OneDrive’ or ‘Google Backup & Sync’. Inside this directory, create a folder to save all your crawl exports & reports. In this instance, I’m using G-drive, but others should work just as well. You’ll need to set a scheduled crawl in the Spider (file > Schedule) to export any tabs, bulk exports or reports into a timestamped folder within this auto-synced directory: Log into or create an account for Zapier and make a new ‘zap’ to email yourself or a list of contacts whenever a new folder is generated within the synced directory you selected in the previous step. You’ll have to provide Zapier access to both your G-Drive & Gmail for this to work (do so at your own risk). My zap looks something like this: The above Zap will trigger when a new folder is added to /Scheduled Crawls/ in my G-Drive account. It will then send out an email from my Gmail to myself and any other contacts, notifying them and attaching a direct link to the newly added folder and Spider exports. I’d like to note here that if running a large crawl or directly saving the crawl file to G-drive, you’ll need enough storage to upload (so I’d stick to exports). You’ll also have to wait until the sync is completed from your desktop to the cloud before the zap will trigger, and it checks this action on a cycle of 15 minutes, so might not be instantaneous. Alternatively, do the same thing on IFTTT (If This Then That) but set it so a new G-drive file will ping your phone, turn your smart light a hue of lime green or just play this sound at full volume on your smart speaker. We really are living in the future now! Conclusion There you have it, the Magnificent Seven(ish) tools to try using with the SEO Spider, combined to form the deadliest gang in the west web. Hopefully, you find some of these useful, but I’d love to hear if you have any other suggestions to add to the list. The post SEO Spider Companion Tools, Aka ‘The Magnificent Seven’ appeared first on Screaming Frog.
0 notes
Text
Superalgos, Part Two: Building a Trading Supermind
Superalgos: Building a Trading Supermind
A supermind of humans and machines thinking and working together, doing everything required to maximize the group’s collective intelligence so as to minimize the time needed for superalgos to emerge is being built right now.
This article is the second half of a two-piece introduction to Superalgos. We highly recommend reading Part One before diving under the hood!
By Luis Molina & Julian Molina
Four trading bots at a 1-day long test competition over ROI.
The Rabbit Hole
It all started as I woke up one day by mid-summer, 2017, the Danube as the backdrop for what would turn into my first serious trading adventure.
Being deep into bitcoin since 2013, I had been fooling around trading crypto as a hobby without any real knowledge for years.
Having finalized my commitments with my latest project, I found myself with time to spare, some bitcoin, and an urge to give trading a serious try.
Earlier that year, I had been spending more time than usual doing manual trading and educating myself on the basics. I had fallen for the illusion that it was quite easy to accumulate profits. I was ready to give it a try at automating the primitive strategies I had come up with.
Little did I know I had actually been surfing the start of the biggest bull wave the markets had ever witnessed…
Even Grandma was making money then!
I had spent some time checking out a few of the trading platforms of the moment and several open source projects. It all seemed clunky, overly complex or constrained to the typical approaches to trading.
I had something else in mind.
Swarms of tourists roamed Budapest.
The sun burnt, the gentle spring long gone.
The crisp morning air screamed through my bedroom window… “get out… it’s awesome out here”.
Nonetheless, I knew this restlessness would not subside. At 44, having spent my life imagining and creating the wildest projects, I knew this feeling well.
I was doomed. I had to give it a shot or I wouldn’t stand myself.
I knew I risked getting sucked into a multi-year adventure to which I would give my 100%, 24/7, until I was satisfied. Ideas have that effect in obsessive characters.
I crawled out of bed and sat at the keyboard.
By the end of the day, I was deep down the rabbit hole. Falling.
The Canvas App
Screenshot of the Canvas App
I had envisioned a platform that would provide a visual environment in which I would be able to browse data in a unique way. Besides the typical candlestick charts and indicator functions, I wanted bots to overlay their activities and decision-making clues in a graphic fashion, directly over the charts.
That would allow me to debug strategies visually.
Although exchanges provide online and historical data, and because I wanted full control over visualization capabilities, I knew I needed to extract data from exchanges and store it within the system, so that data could be loaded in very specific manners.
By that time, Poloniex was one of the largest crypto exchanges out there, so I started with them. Still, I knew from day one this platform would be exchange agnostic.
It took me several months and many iterations to craft a process that would extract historical data from Poloniex in a way that guaranteed — no matter what happened to the process — that the resulting data set would be reliable, with no errors, holes or missing parts — ever.
The raw data available at exchanges represents trades. With trades, candles and volumes can be calculated, and with those a constellation of other data structures, usually known as indicators.
I realized early on that the amount of data to manage was quite big. The USDT/BTC market in Poloniex spanning four years of history at that time consisted of more that 300 million trade records.
Even when grouped into candles, it was a lot of data.
While I was polishing the extraction process, I was also programing the graphical environment to plot the data. I called this part of the platform the Canvas App, as it uses an HTML5 canvas object to render the graphics.
By that time I knew what I was doing was too big for a single developer. I also knew it would be ridiculous to build such a platform for my personal use only.
I was ready to go open source.
I showed what I had to a few trader friends, one of which — Andreja — would eventually join the project.
I also showed it to Julian, my brother, an entrepreneur himself, always avid for good adventures, co-author in this series of articles, who would too join later on.
The feedback was varied.
Everyone had their own views.
That lead me to believe that I should not constrain the platform to my own personal taste and that I should make it as open as possible. I knew how I wanted to visualize data, but I didn’t know how other people might want to do it.
That also meant that the platform needed to enable the visualization and navigation through data sets independently of what they were, what information they contained, or how information would be plotted.
Well, that was a big problem to solve.
How do I create a tool that is able to visualize huge data sets when I don’t know how data sets will be structured or how people will want to graphically represent them on the charts?
Google Maps
Google Maps allows navigation of huge datasets in an amazingly intuitive way.
The inspiration came from Google Maps.
It too has a number of huge data sets, with different layers of different information.
The app allows users to smoothly navigate from one side of the globe to the other just by dragging the map.
Zooming in and out dynamically loads new layers containing more detailed information.
It supports multiple plugins (traffic, street view, etc) and runs them within the same interface.
It has an API that allows embedding the app on any web page.
These were all great features and I wanted them all in the Canvas App.
It took Google several years — and a few acquisitions — to evolve the app to what it is today: one of the most powerful collection of geospatial tools.
In my case I had to figure out which were the common denominators in all data sets in order to give some structure to the user interface. It seemed that most data sets were organized over time.
This is not strictly true in all cases, but it was a good enough approximation to start with.
So the UI needed to manage the concept of time, which meant that instead of navigating locations as in Google Maps, users would navigate through time, back and forth.
Time would be plotted on the X axis — usually, the longest in the screen.
Zooming in and out would have the effect of contracting and expanding time, not space.
The Y axis was reserved for plotting different markets at different exchanges. I wanted users to be able to easily browse and visually compare markets side by side.
Since the platform should not constrain the way the user would visualize data, I introduced the concept of a plotter — a piece of code that renders a graphical representation of a certain type of data set, which is separate from the Canvas App.
From the Canvas App point of view, a plotter is like a plug-in.
People would be able to create plotters to render their own data sets. And, why not, to be used by others on data sest of the same type.
I took me several 100% code rewrites to bring all this to reality.
Canvas App v0.1, checked.
Of course, to test all these concepts, I needed proper data sets that I could plot in the canvas app. Raw trades are, well, insipid?
So, at the same time I was working on the Canvas App, I was developing the first few elaborate data sets: candles, volumes, and others…
It was December, 2017.
Five months free falling.
The Core Team
Conversations with my brother Julian, a partner in many previous ventures, had become more and more frequent.
We had been toying with ideas around the gamification of trading, specialization of bots, bots’ autonomy and a bots’ marketplace.
By December 2017, Julian put to rest everything else he was doing and started working on the project full time.
It took us two months and half a dozen rewrites of a Project Narrative document to destil most of the concepts explained in Part I of this series of articles.
Our friend Andreja, a former hedge fund manager in Switzerland and former pro trader with Goldman Sachs, was getting excited about it all and started offering valuable expert feedback.
In the meantime, before the end of 2017 I made a trip to my hometown, Cordoba, Argentina.
As usual, I got invited to speak at meetups to share my latest adventures in the crypto world. Ten minutes from wrapping up one of the talks, someone asked what I was doing at the moment.
I realised I hadn’t publicly discussed the project before.
I wasn’t ready, but I gave it a shot. Those final 10 minutes where the most exciting for the audience.
Matias, a senior developer for an American car manufacturer approached me with interest to help in the project. Six months later he would quit his job to work with us full time.
Back in Budapest, I asked my friend Daniel Jeffries for some help regarding trading strategies. He recommended implementing a simple trending strategy based on LRC, short for Linear Regression Curves.
Matias started developing the LRC indicator and its corresponding plotter along with the trading bot that would use it.
In many trading platforms, indicators are math functions that come in libraries, ready to use. In our case this didn’t fit the way we wanted to visualize data sets ala Google Maps style.
Google Maps dynamically loads the data around the location the user is in, considering the zoom level. If the user moves, it will download more data, on demand. If the internet connection is slow or the user moves too fast, the worst case scenario is that the user might see parts of the map pixelated or with information appearing at different zoom intervals. It is clearly a very efficient system.
In our platform the experience is similar, only that the Canvas App is still in its infancy and doesn’t have the iterations required to remove all the rough edges.
Indicators may require data close to the user’s position in the timeline or very far from it, or maybe even all of it.
This fact made it impossible to use indicators as functions as it would be impossible to render them in Google Maps style. That would require loading big chunks of data and calculating the indicator on the fly correctly. Every little move in the timeline would require more online calculations after loading the data.
Now, this may not sound like a problem if you are thinking of one bot or one indicator.
However, we knew we wanted to be able to plot large volumes of data in a multi-user environment, with tens of bots in the screen concurrently, and still enjoy the Google Maps navigational characteristics.
The solution to this problem was to precalculate the indicators at the cloud, and output a new data set with the results of those calculations. That would allow the platform to load only small chunks of indicator data around the user’s position on the timeline.
Matias experienced this first hand, as calculating Linear Regression Curves requires analyzing information beyond the current position.
His first implementation was through a function, but later switched to a process which would have candles as an input, and data points of the curve as an output.
By March 2018 we were ready to start working on some front end stuff so Julian invited Barry, a graphic designer and front end dev based in California we knew from previous projects. After some exploration time toying with branding elements and some coding, he closed shop for his freelance clients and started working with us full time too.
Andreja finally entered the core team in the second half of 2018 bringing his friend Nikola — at 26, the youngest member of the core team, and — most likely — the strongest developer.
Financial Beings
Image © Who is Danny, Shutterstock.
Trading bots are — at their core — algorithms that decide when to buy, or when to sell the assets under management based on a certain strategy. Evolution would be the mechanism to increase their intelligence.
At some point in time we got to understand that there where other kinds of computer programs in our ecosystem that would benefit from the paradigm of financial life.
Take, for example, the processes which extract data from exchanges.
Or the processes which digest data sets to produce other data sets — indicators.
Should they be financial beings too?
And what about the plotters used to render data sets in a graphical environment?
Becoming financial beings would convey these entities all the virtues entailed in the evolutionary model. They too would improve with time.
Note that this does not mean these entities need to compete in trading competitions. They do not trade. But still, they would compete to offer quality services in a marketplace — a data marketplace.
This internal data marketplace is a fundamental piece of the ecosystem. Why? Because quality data is a crucial aspect of the decision-making process of trading algorithms.
It is important that data organisms evolve together with trading organisms.
As a result, we expanded the evolutionary model to include sensors who retrieve information from external sources like exchanges, indicators who process sensors’ or other indicators’ data to create more elaborate data sets, plotters who render data sets on the canvas, and traders who decide when to buy or sell.
We call all these bots Financial Beings (FB).
By definition, a FB is an algorithm that features financial life, which means they have an income and expenses, and require a positive balance of ALGO tokens to be kept alive. Being kept alive means being executed at a specific time interval so that the algorithm may run and do whatever it is programmed to do.
Previously, we stated that trading algorithms participate in trading competitions, get exposure according to competition rankings, and the crowd may pick them for breeding — by forking them. Later on, asset holders may rent them to trade on their behalf.
But what about the others?
How do sensors, indicators and plotters lead a financial life?
Sensors and indicators will sell subscriptions to their data sets, mainly to trading bots, but also to other indicators as well. That will be their source of income. There will be an inter-bot economy within the ecosystem — the data marketplace.
The market will decide what data is worth paying for, therefore, only useful indicators will survive.
Plotters render datasets on the screen. The crowd of traders and developers in the community may choose to subscribe to plotter services, in particular those who offer a superior visual experience.
The part of the system involving transfer of value in between bots is not implemented yet. We are very much aware that it will be challenging to build a balanced model that works well in every sense. Current analysis on that front is still shallow, but we are looking forward to getting to work on those issues.
So, now that all these entities had been recognized, the next question in line was: where would each of these FBs going to run?
The Browser & the Cloud
I always thought that in order to debug trading strategies properly, the debugging tool should be highly coupled with the graphic user interface; the one plotting the data.
It is tremendously much easier for humans to identify patterns in graphical representations of data than on hard numbers.
The same should be true for debugging the behaviour of trading bots.
If we plotted everything a trading bot did directly over the timeline, including all relevant information as of why it did what it did, it would be a powerful tool to later identify why it didn’t act as expected or how the current behaviour may be improved.
To me, it was clear from the start that first versions of bots should be able to run at the browser, directly at the platform’s Canvas App. Also, while not debugging but running live or competing with other bots, they should run at the cloud.
It took me a while — and a few rewrites — to put the pieces together and create the software that would act as the execution environment of bots in the cloud — what I called the Cloud App — and that, at the same time and without any changes at the bot source code level, could run bots in the browser too.
That goal was achieved by mid-2018.
Since then, the exact same bot source code is able to trade in debug mode from the browser, and be executed at the cloud in live, backtesting or competition modes.
Developers are now able to use the browser developer tools to debug bot’s code, while viewing the real-time plotting of bot’s activities in the Canvas App.
This setup effectively lowers entry barriers for developers.
There is no environment to install.
No setups.
No configurations.
No requirements on developers’ machines.
It takes one minute to sign up.
Right after sign up, the system provides the user with a fork of the latest trading champion.
Developers may have a bot up and running in — literally — two minutes.
Of course, all these advantages have a trade off.
The main one is probably that in order to run bots at the browser they need to be written in JavaScript, which is not the most popular language to develop trading bots.
The main trade off seemed reasonable. Mainly because at early stages, with a minimum viable product, we intend to bootstrap the community with amateur traders and developers willing to experiment, have fun and compete.
Pro traders are — of course — very much welcome too. They too may find the proposition fun and interesting. But we do not really expect them to bring in their top secret python bots to be open sourced.
With time, the community will demand other languages to be supported and that demand will drive the evolution of the platform.
In a similar way, we foresee that a segment of the community may tend to professionalize, and different interest will drive the platform to evolve in different directions.
The Assistant
Because trading bots are meant to compete with each other and eventually service clients, several issues arise.
How do we guarantee a fair competition?
How do we ensure that bots do not cheat in terms of what they did and achieved?
Also, how do we assure future clients that the asset management services they will be subscribing to have the track record they claim to have?
The logical thing to do is to forbid trading bots to place orders on exchanges by themselves. Instead, we offer a common and transparent infrastructure to do it on demand.
The platform keeps track of all operations bots perform at the exchange — usually buy and sell orders — and monitors the exchange to track orders that get filled — both in full or partially.
We call this piece of the platform The Assistant, since it assists trading bots with all interactions with exchanges.
The Assistant solves the problem of trading bots cheating in regards with their performance.
It is the Assistant the one which, before each trading bot execution, checks the account at the exchange, verifies which orders were filled, keeps a balance for each asset in the pair, keeps a running available balance — the balance of funds available, that is, the funds not committed in unfilled orders — and exposes the methods trading bots can call when they decide to buy or sell.
In addition, the Assistant calculates trading bots ROI and saves that information which becomes available for the rest of the system.
The features packed in the Assistant guarantee a fair competition, since trading bots have no control over the tracking the Assistant does. This means that two or more bots competing play on a leveled field, bound by the same rules.
Scalability
Docker + Kubernetes = High Scalability.
Sometime along the way we got feedback from a hardcore trading algos aficionado: he was worried the platform would not scale to run thousands of bots.
He was right and he was wrong at the same time.
He was right because, at that stage, any audit of the code and the infrastructure would have concluded neither scaled.
But he was equally wrong.
He missed the crucial point that the platform itself — like most open source projects — was subject to evolution.
He also failed to understand that what he was reviewing at that point in time was merely the solution to the first handful of problems — from a list of hundreds — to be solved for superalgos to emerge.
There is no technical problem that time and brain power can not solve.
The key takeaway is that superalgos will emerge from a global supermind.
Whatever needs to be done for superalgos to emerge, will be done.
It took Matias the last quarter of 2018 to containerize the execution at the cloud.
By doing so, it not only solved the scalability problem, but also a few other issues found hiding in the attic…
How do we allow bot creators to run bots by themselves?
How do we limit accessibility so that evil bots can not interfere with other bots or mess up with the platform itself?
The solution came in the form of Kubernetes, an open source software capable of orchestrating the deployment of Docker containers designed precisely to solve these problems.
Probably in the spirit to emphasize the fact that the platform is subject to the dynamics of evolution too, Nikola preaches a theory that the platform itself should be considered a type of bot, or at least some parts of it, like the Assistant.
System Modules
The system is divided in modules. We expect to build at least 50 modules in the next few years. | Image © D1SK, Shutterstock.
The second half of 2018 saw us working on a number of issues further down the list…
If the crowd was to organize itself into teams, they would need to sign up, manage their profiles, discover existing teams, apply to join or create new ones…
The Canvas App was not the right place to manage all this.
We needed a web based system.
We anticipated it would be rather big, and it too would be subject to evolution.
Once again, the mission is to do everything required to have superalgos emerge. If we accept no one has the definite recipe, we need to build flexible solutions capable of evolving.
This reasoning pointed us in the direction of a modular system.
The most urgent was the Users Module. A piece of software responsible for managing everything related to users. It would allow users to discover and search for other users, as well as create and manage user profiles. The module would readily accept more functionality related to users as demand evolves.
I developed this module myself.
Second in line, was the Teams Module, developed by Barry. Its scope is everything related to teams. It allows users to explore and find existing teams, apply to become members, create teams, define admins, invite users to join and manage membership applications. It too is ready to evolve within that scope.
Trading bots need the API Keys to access the exchange and perform trades. The keys are obtained from the exchange and need to be imported into our system so that they are available when needed.
To solve this requirement, Matias developed the Key Vault Module, which is responsible for allowing users to manage their exchange API keys. That includes defining which bot will use each key. Needless to say, all Key handling must be done in a secure way.
We also needed a module to define and manage competition events, so Nikola developed the Events Module. It allows browsing upcoming, active and past competition events, as well as creating competitions with their own set of rules, prizes and more.
We already envision over 50 more modules, each one with a specific responsibility and scope.
But before building more modules, we needed a framework capable of running this modules in a coordinated manner, providing common navigation and other shared functionality.
The Master App
System modules front ends are transformed into NPM JavaScript libraries, which are listed as dependencies of the Master App so that only one web app is loaded at the browser. By the end of 2018 this framework was up and running, with the first set of modules in it.
Implementation wise, the Master App is a React / Redux app, which accesses a GraphQL backend API via Apollo. The Master App API is in fact an API Router, which forwards incoming requests to the backend of each individual module.
A module backend consists of a Node.js / Express app, with either MongoDB or Postgre databases depending on the preferences of each individual developer.
Integrating the Canvas App, the Cloud App and the Master App, represented a significant milestone. By then we had all initial pieces in place and were ready to start adding more, while evolving the existing ones.
The system is untested, thus in alpha stage.
Code quality varies a lot.
The current perception on the odds of the code in each part of the system to survive evolution determines how much work is put in cleaning up and making the code robust.
As the core team grew and ideas started being discussed among a larger group of people, the initial set of concepts got refined and extended.
Algobots
ROI of two Algobots as seen on the Canvas App.
Trading algorithms may be simple or complex. Maybe complex ones can be disassembled into many simpler ones.
What would be the simplest kind of trading algorithm, the equivalent to a unicellular being in biology?
The basic trading algorithms are the ones that implement a single strategy. The ones that do one thing well. In many cases this could mean the strategy is based in a single indicator.
We don’t need to impose any specific border on its intelligence. We just assume that — like a unicellular organism — it is self-sufficient and stays alive on its own merits — as long as it can obtain food from its surroundings.
We call this — the smallest unit of trading intelligence — an algobot.
Algonets
ROI of multiple Algobot clones controlled by a single Algonet. Notice how some of them decided to sell at a different moment, impacting their ROI.
We knew early on that algobots where not going to be alone in the trading space of bots running on the platform.
Let’s analyze algobots further…
What are algobots main components?
The most important component is the source code. After that, the configuration, which provides many details needed at execution time. Third in line: parameters; values set for each specific execution.
I’ll explain this with an example:
Let’s imagine an algobot called Mike.
Mike’s source code features the logic to decide when to buy and when to sell based on the LRC indicator described earlier. If Mike identifies 3 buying signals then it buys; if it identifies 2 selling signals, then it sells.
Numbers 3 and 2 on this example are arbitrary numbers. They may have been 4 and 3, for instance.
To clarify the difference between configuration and parameters, what we would find at Mike’s configuration file is the valid ranges for the number of buy and sell signals. Such ranges could be, for example, [2..6] and [3..7]. The actual parameters or values which Mike will use at any specific run are set right before it is put to run. These values could be 3 and 2 as proposed earlier, or 4 and 3, or any valid combination within the configured ranges.
Because parameters can be manipulated at will and any change may affect the behaviour of trading algorithms, we call parameters genes.
In the example above, we would say that Mike has two genes, each one with a range of possible values. The genes would be number of buy signals and number of sell signals.
An algonet is a different kind of trading organism; a higher order intelligence.
Algonets do not trade on their own. They use algobot clones to do the job. They dynamically create different clones of a single algobot, each with different values on their genes. Potentially, algonets may deploy as many algobot clones as there are genes value combinations.
Let’s imagine Mike Algonet — yes, algonets are constrained to one type of algobot only.
Mike Algonet is deployed with 1 million dollars in assets under management.
Initially, it deploys as many Mike Algobot Clones as there are possible genes combinations, each with a unique set of gene values. Upon deployment, Mike Algonet assigns each clone a tiny amount of capital to trade with.
Our evolutionary model dictates that each clone needs ALGO tokens to stay alive, so, on top of capital to trade with, each clone receives a small amount of ALGO… whatever is required to live long enough to test their genes on the market!
Mike Algonet monitors the performance of each clone and dynamically decides which clones should be kept alive longer. Those are fed more ALGO.
When Mike Algonet figures out that any individual clone is not doing a good enough job, it simply stops feeding it, and the abandoned clone soon dies.
Once Mike Algonet is confident that a clone is performing well, it may place more capital under the selected clone’s management.
These are standard algonet mechanisms.
On top of that, algonets have their own source code so that developers can innovate in any way they see fit.
We have coded a simple version of an algonet, and experienced first hand how clones with different genes operate differently, as shown in the image above.
Advanced Algos
Advanced algos control algonets which in turn control algobots
Now, imagine an entity of an even higher order, above algonets.
We call these advanced algos.
Advanced algos are capable of deploying clones of algonets. They may clone the same algonet multiple times with different parameters, or different algonets altogether, or any combination in between.
Like algonets, advanced algos use capital and ALGO to keep algonets working and alive, and dynamically decide how to allocate these two resources to obtain the best possible overall performance.
In turn, algonets working under advanced algos’ tutelage still control and operate their own algobots as explained earlier.
The same as with algonets and algobots, advanced algos too have source code for creators to innovate beyond the standard functionality all advanced algos have.
The advanced algos-algonet-algobot structure is analogous to the biological structure of organs-tissues-cells.
In biological life, cells work to keep themselves alive, but also work together with other cells to form tissues and adopt certain higher order functions. Something similar happens with multiple tissues forming organs. Even though organs are made of tissues, they have their own life, purpose, and mission.
After playing around long enough with this 3-layer hierarchy, the next set of questions emerged…
Are there even higher order entities with some well-defined purpose?
Will we have an equivalent of a complex biological organism? An entity that contains organ systems, organs, tissues and cells, and coordinates them all for a higher purpose?
Financial Organisms
A Financial Organism is made out of a hierarchy of nodes in which each node may control advanced algos, algonets or algobots as needed.
Financial Organisms are structures of bots that assemble themselves into hierarchies with no imposed rules or limits.
In the paradigm of advanced algos-algonets-algobots hierarchy described earlier, the individual decision to trade or not is up to each algobot clone. If any clone decides to buy, then it would do so using its own trading capital.
Financial Organism (FO) are made up of nodes.
The first node in the hierarchy is called by the Assistant.
This node may in turn call other nodes, which may themselves call other nodes if needed.
All of these nodes are free to clone algobots, algonets or advanced algos and put them to run.
Nodes that put bots to run would effectively be impersonating the Assistant — remember it is normally the Assistant who calls bots. That means that when any of the clones wishes to place a buy or sell order, the command is received by the calling node, which can use that call as a signal for its own decision making process.
Put in other words, nodes may use any set of trading bots as signalers.
Once a node arrives to a buy or sell decision, it places a buy or sell order upwards through the nodes hierarchy. The order is received and factored in by the node upstream through its own logic.
The process continues all the way up to the root node.
The ultimate buy and sell order is placed by the root node through the actual Assistant, which takes the order to the exchange.
Mining ALGO
The system required to support the concepts described herein is big and complex.
Having all modules developed by a small group of developers does not seem to be the most efficient way to accomplish the mission.
It is not the most resilient approach either.
In terms of technology requirements, the unique characteristic of this project is that all of the modules are expected to evolve significantly. Moreover, the modules identified so far are certainly not all the modules that will be required for superalgos to emerge.
The project’s mission states we are to do “whatever it takes”, explicitly suggesting that we might need to do more than we can envision today.
Just like we expect a huge community of developers to breed bots, we also expect a sizeable part of the community to get in the project’s development team and help build systems modules.
ALGO tokens need to be issued and distributed among all supermind participants.
Because the key engine for evolution are competitions, it was decided early on that 25% of the total supply of ALGO were to be reserved for competition prizes.
15% of ALGO would go to developers building the system.
10% would go to people working on business development.
50% would be reserved for financing and other incentives.
The mechanism to distribute the pool reserved for competitors was clear from the beginning: competition prizes would be distributed through the system among competition participants.
The reserved pool was not going to be distributed yet, so no need to establish a distribution mechanism.
That left us with the issue of defining the mechanism to distribute the 15% among system developers and the 10% among business developers.
We toyed and iterated several ideas which resulted in the following requirement:
Whoever develops a system module should also maintain it.
The reason for that is that the system would quickly become too complex for a small core team to maintain. It would be manageable for a large extended team, though. In fact, it would work particularly well if each party developing a module remains responsible for maintaining it in the long run.
This means that both development and long term maintenance needed to be incentivized.
The solution we found mimics, in a way, a kind of crypto mining scheme.
Whoever takes responsibility for a system module, develops it, integrates it with the rest of the system and pushes it to production, then acquires an ALGO Miner.
An ALGO Miner is not a hardware piece, but a smart contract running on the Ethereum blockchain.
The miner, that is, the person mining, receives a daily stream of tokens.
Each ALGO Miner has a pre-defined capacity. There are 5 different categories, each with their own capacity, since not all modules have the same importance, difficulty or criticality.
For instance, a category 1 ALGO Miner distributes 1 million ALGO. The same goes all the way up to category 5 miners which distribute 5 million ALGO.
Where do these tokens come from?
For developers, from the System Development Pool distributing 15% of the total supply.
A similar mining scheme has been established for people working in business development. In this case, the responsibility does not lie in developing and maintaining a system module, but in developing and maintaining a business area.
So, for business people, miners obtain the tokens from the Business Development pool.
ALGO Miners feature a distribution schedule that mimics the halving events of bitcoin mining, except that instead of halving every 4 years, the halving occurs every year.
50% of an ALGO Miner capacity is distributed day by day during the first year.
The second year distributes half of what was distributed the first year, that is 25% of the total.
The third year, half of the previous year, and son on.
By the fifth year, over 98% of the capacity will have been distributed.
ALGO Miners also distribute ALGO collected in fees from financial beings running on the platform.
These fees are part of the expenses financial beings have to pay. We call it the Time to Live fee, meaning financial beings buy time to be kept alive.
The Time to Live fee is collected and distributed among active miners, on a daily basis.
This is the second source of ALGO tokens for miners. We expect this source of income to be minimal during the first few years, but with time and increased adoption, it should steadily grow, allowing miners to have a substantial income by the time the ALGO coming from distribution pools is exhausted.
These are the mechanism we have engineered to incentivize the team in the long run. These set of incentives are also part of the evolutionary model, in this case enabling the system itself to evolve.
The Team
Beyond the Core Team, there is a wider Project Team building parts of the system. Team members are compensated through the mining of ALGO tokens, as described earlier.
To become a team member the applicant needs to take responsibility either over one piece of the software pending development, or over a business development area.
Examples of pieces of the puzzle being assembled by the Project Team are the ethereum smart contracts needed to issue and distribute the ALGO tokens, the ALGO Shop module, the Logs module and the Financial Beings module.
Smart contracts are still undergoing testing, thus mining hasn’t started yet.
As soon as testing is finalized, the contracts will be deployed and mining will kick off.
It’s been a while since that crisp summer morning of 2017 that saw the Superalgos project start.
Lots of water run down the Danube.
The Core Team has laid down the foundations for a great project and we are ready to step up the game by growing the team that will develop and maintain the remaining system modules and business areas.
We encourage everyone willing to see our vision materialize to come forward, contribute some work and become a part of the team.
Come plug your brain in the largest-ever trading supermind!
Follow Superalgos on Twitter or Facebook; or visit us on Telegram or at our web site.
A bit about me: I am an entrepreneur who started his career long time ago designing and building banking systems. After developing many interesting ideas through the years, I started in 2017 Superalgos. Finally, the project of a lifetime.
Superalgos, Part Two: Building a Trading Supermind was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
[Telegram Channel | Original Article ]
0 notes
Text
Using Python to recover SEO site traffic (Part three) Search Engine Watch
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter .
Want to stay on top of the latest search trends?
Get top insights and news from our search experts.
Related reading
Complete overivew of what Google Search Console is, what it does for your site, how to use it, and what you need to get started taking advantage of it today.
Last month, Google tested AR functionality in Google Maps. What are the implications of VPS, street view, and machine learning for local search and SEOs?
The robots.txt file is an often overlooked and sometimes forgotten part of a website and SEO. Here’s what it is, examples, how to’s, and tips for success.
What exactly agencies need when it comes to website audits and what to look for in choosing a tool. Five specific recommendations, screenshots, examples.
Want to stay on top of the latest search trends?
Get top insights and news from our search experts.
Source link
0 notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from IM Tips And Tricks https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/ from Rising Phoenix SEO https://risingphxseo.tumblr.com/post/184297809275
0 notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
source https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/ from Rising Phoenix SEO http://risingphoenixseo.blogspot.com/2019/04/using-python-to-recover-seo-site.html
0 notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Search Engine Watch https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
Link
via Screaming Frog
From crawl completion notifications to automated reporting: this post may not have Billy the Kid or Butch Cassidy, instead, here are a few of my most useful tools to combine with the SEO Spider, (just as exciting).
We SEOs are extremely lucky—not just because we’re working in such an engaging and collaborative industry, but we have access to a plethora of online resources, conferences and SEO-based tools to lend a hand with almost any task you could think up.
My favourite of which is, of course, the SEO Spider—after all, following Minesweeper Outlook, it’s likely the most used program on my work PC. However, a great programme can only be made even more useful when combined with a gang of other fantastic tools to enhance, compliment or adapt the already vast and growing feature set.
While it isn’t quite the ragtag group from John Sturges’ 1960 cult classic, I’ve compiled the Magnificent Seven(ish) SEO tools I find useful to use in conjunction with the SEO Spider:
Debugging in Chrome Developer Tools
Chrome is the definitive king of browsers, and arguably one of the most installed programs on the planet. What’s more, it’s got a full suite of free developer tools built straight in—to load it up, just right-click on any page and hit inspect. Among many aspects, this is particularly handy to confirm or debunk what might be happening in your crawl versus what you see in a browser.
For instance, while the Spider does check response headers during a crawl, maybe you just want to dig a bit deeper and view it as a whole? Well, just go to the Network tab, select a request and open the Headers sub-tab for all the juicy details:
Perhaps you’ve loaded a crawl that’s only returning one or two results and you think JavaScript might be the issue? Well, just hit the three dots (highlighted above) in the top right corner, then click settings > debugger > disable JavaScript and refresh your page to see how it looks:
Or maybe you just want to compare your nice browser-rendered HTML to that served back to the Spider? Just open the Spider and enable ‘JavaScript Rendering’ & ‘Store Rendered HTML’ in the configuration options (Configuration > Spider > Rendering/Advanced), then run your crawl. Once complete, you can view the rendered HTML in the bottom ‘View Source’ tab and compare with the rendered HTML in the ‘elements’ tab of Chrome.
There are honestly far too many options in the Chrome developer toolset to list here, but it’s certainly worth getting your head around.
Page Validation with a Right-Click
Okay, I’m cheating a bit here as this isn’t one tool, rather a collection of several, but have you ever tried right-clicking a URL within the Spider? Well, if not, I’d recommend giving it a go—on top of some handy exports like the crawl path report and visualisations, there’s a ton of options to open that URL into several individual analysis & validation apps:
Google Cache – See how Google is caching and storing your pages’ HTML.
Wayback Machine – Compare URL changes over time.
Other Domains on IP – See all domains registered to that IP Address.
Open Robots.txt – Look at a site’s Robots.
HTML Validation with W3C – Double-check all HTML is valid.
PageSpeed Insights – Any areas to improve site speed?
Structured Data Tester – Check all on-page structured data.
Mobile-Friendly Tester – Are your pages mobile-friendly?
Rich Results Tester – Is the page eligible for rich results?
AMP Validator – Official AMP project validation test.
User Data and Link Metrics via API Access
We SEOs can’t get enough data, it’s genuinely all we crave – whether that’s from user testing, keyword tracking or session information, we want it all and we want it now! After all, creating the perfect website for bots is one thing, but ultimately the aim of almost every site is to get more users to view and convert on the domain, so we need to view it from as many angles as possible.
Starting with users, there’s practically no better insight into user behaviour than the raw data provided by both Google Search Console (GSC) and Google Analytics (GA), both of which help us make informed, data-driven decisions and recommendations.
What’s great about this is you can easily integrate any GA or GSC data straight into your crawl via the API Access menu so it’s front and centre when reviewing any changes to your pages. Just head on over to Configuration > API Access > [your service of choice], connect to your account, configure your settings and you’re good to go.
Another crucial area in SERP rankings is the perceived authority of each page in the eyes of search engines – a major aspect of which, is (of course), links., links and more links. Any SEO will know you can’t spend more than 5 minutes at BrightonSEO before someone brings up the subject of links, it’s like the lifeblood of our industry. Whether their importance is dying out or not there’s no denying that they currently still hold much value within our perceptions of Google’s algorithm.
Well, alongside the previous user data you can also use the API Access menu to connect with some of the biggest tools in the industry such as Moz, Ahrefs or Majestic, to analyse your backlink profile for every URL pulled in a crawl.
For all the gory details on API Access check out the following page (scroll down for other connections): https://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#google-analytics-integration
Understanding Bot Behaviour with the Log File Analyzer
An often-overlooked exercise, nothing gives us quite the insight into how bots are interacting through a site than directly from the server logs. The trouble is, these files can be messy and hard to analyse on their own, which is where our very own Log File Analyzer (LFA) comes into play, (they didn’t force me to add this one in, promise!).
I’ll leave @ScreamingFrog to go into all the gritty details on why this tool is so useful, but my personal favourite aspect is the ‘Import URL data’ tab on the far right. This little gem will effectively match any spreadsheet containing URL information with the bot data on those URLs.
So, you can run a crawl in the Spider while connected to GA, GSC and a backlink app of your choice, pulling the respective data from each URL alongside the original crawl information. Then, export this into a spreadsheet before importing into the LFA to get a report combining metadata, session data, backlink data and bot data all in one comprehensive summary, aka the holy quadrilogy of technical SEO statistics.
While the LFA is a paid tool, there’s a free version if you want to give it a go.
Crawl Reporting in Google Data Studio
One of my favourite reports from the Spider is the simple but useful ‘Crawl Overview’ export (Reports > Crawl Overview), and if you mix this with the scheduling feature, you’re able to create a simple crawl report every day, week, month or year. This allows you to monitor and for any drastic changes to the domain and alerting to anything which might be cause for concern between crawls.
However, in its native form it’s not the easiest to compare between dates, which is where Google Sheets & Data Studio can come in to lend a hand. After a bit of setup, you can easily copy over the crawl overview into your master G-Sheet each time your scheduled crawl completes, then Data Studio will automatically update, letting you spend more time analysing changes and less time searching for them.
This will require some fiddling to set up; however, at the end of this section I’ve included links to an example G-Sheet and Data Studio report that you’re welcome to copy. Essentially, you need a G-Sheet with date entries in one column and unique headings from the crawl overview report (or another) in the remaining columns:
Once that’s sorted, take your crawl overview report and copy out all the data in the ‘Number of URI’ column (column B), being sure to copy from the ‘Total URI Encountered’ until the end of the column.
Open your master G-Sheet and create a new date entry in column A (add this in a format of YYYYMMDD). Then in the adjacent cell, Right-click > ‘Paste special’ > ‘Paste transposed’ (Data Studio prefers this to long-form data):
If done correctly with several entries of data, you should have something like this:
Once the data is in a G-Sheet, uploading this to Data Studio is simple, just create a new report > add data source > connect to G-Sheets > [your master sheet] > [sheet page] and make sure all the heading entries are set as a metric (blue) while the date is set as a dimension (green), like this:
You can then build out a report to display your crawl data in whatever format you like. This can include scorecards and tables for individual time periods, or trend graphs to compare crawl stats over the date range provided, (you’re very own Search Console Coverage report).
Here’s an overview report I quickly put together as an example. You can obviously do something much more comprehensive than this should you wish, or perhaps take this concept and combine it with even more reports and exports from the Spider.
If you’d like a copy of both my G-Sheet and Data Studio report, feel free to take them from here: Master Crawl Overview G-Sheet: https://docs.google.com/spreadsheets/d/1FnfN8VxlWrCYuo2gcSj0qJoOSbIfj7bT9ZJgr2pQcs4/edit?usp=sharing Crawl Overview Data Studio Report: https://datastudio.google.com/open/1Luv7dBnkqyRj11vLEb9lwI8LfAd0b9Bm
Note: if you take a copy, some of the dimension formats may change within DataStudio breaking the graphs, so it’s worth checking date is still set to0 ‘Date (YYYMMDD)’
Building Functions & Strings with XPath Helper & Regex Search
The Spider is capable of doing some very cool stuff with the extraction feature, a lot of which is listed in our guide to web scraping and extraction. The trouble with much of this is it will require you to build your own XPath or regex string to lift your intended information.
While simply right-clicking > Copy XPath within the inspect window will usually do enough to scrape, by it’s not always going to cut it for some types of data. This is where two chrome extensions, XPath Helper & Regex- Search come in useful.
Unfortunately, these won’t automatically build any strings or functions, but, if you combine them with a cheat sheet and some trial and error you can easily build one out in Chrome before copying into the Spider to bulk across all your pages.
For example, say I wanted to get all the dates and author information of every article on our blog subfolder (https://www.screamingfrog.co.uk/blog/).
If you simply right clicked on one of the highlighted elements in the inspect window and hit Copy > Copy XPath, you would be given something like: /html/body/div[4]/div/div[1]/div/div[1]/div/div[1]/p
While this does the trick, it will only pull the single instance copied (‘16 January, 2019 by Ben Fuller’). Instead, we want all the dates and authors from the /blog subfolder.
By looking at what elements the reference is sitting in we can slowly build out an XPath function directly in XPath Helper and see what it highlights in Chrome. For instance, we can see it sits in a class of ‘main-blog–posts_single-inner–text–inner clearfix’, so pop that as a function into XPath Helper: //div[@class="main-blog--posts_single-inner--text--inner clearfix"]
XPath Helper will then highlight the matching results in Chrome:
Close, but this is also pulling the post titles, so not quite what we’re after. It looks like the date and author names are sitting in a sub <p> tag so let’s add that into our function: (//div[@class="main-blog--posts_single-inner--text--inner clearfix"])/p
Bingo! Stick that in the custom extraction feature of the Spider (Configuration > Custom > Extraction), upload your list of pages, and watch the results pour in!
Regex Search works much in the same way: simply start writing your string, hit next and you can visually see what it’s matching as you’re going. Once you got it, whack it in the Spider, upload your URLs then sit back and relax.
Notifications & Auto Mailing Exports with Zapier
Zapier brings together all kinds of web apps, letting them communicate and work with one another when they might not otherwise be able to. It works by having an action in one app set as a trigger and another app set to perform an action as a result.
To make things even better, it works natively with a ton of applications such as G-Suite, Dropbox, Slack, and Trello. Unfortunately, as the Spider is a desktop app, we can’t directly connect it with Zapier. However, with a bit of tinkering, we can still make use of its functionality to provide email notifications or auto mailing reports/exports to yourself and a list of predetermined contacts whenever a scheduled crawl completes.
All you need is to have your machine or server set up with an auto cloud sync directory such as those on ‘Dropbox’, ‘OneDrive’ or ‘Google Backup & Sync’. Inside this directory, create a folder to save all your crawl exports & reports. In this instance, I’m using G-drive, but others should work just as well.
You’ll need to set a scheduled crawl in the Spider (file > Schedule) to export any tabs, bulk exports or reports into a timestamped folder within this auto-synced directory:
Log into or create an account for Zapier and make a new ‘zap’ to email yourself or a list of contacts whenever a new folder is generated within the synced directory you selected in the previous step. You’ll have to provide Zapier access to both your G-Drive & Gmail for this to work (do so at your own risk).
My zap looks something like this:
The above Zap will trigger when a new folder is added to /Scheduled Crawls/ in my G-Drive account. It will then send out an email from my Gmail to myself and any other contacts, notifying them and attaching a direct link to the newly added folder and Spider exports.
I’d like to note here that if running a large crawl or directly saving the crawl file to G-drive, you’ll need enough storage to upload (so I’d stick to exports). You’ll also have to wait until the sync is completed from your desktop to the cloud before the zap will trigger, and it checks this action on a cycle of 15 minutes, so might not be instantaneous.
Alternatively, do the same thing on IFTTT (If This Then That) but set it so a new G-drive file will ping your phone, turn your smart light a hue of lime green or just play this sound at full volume on your smart speaker. We really are living in the future now!
Conclusion
There you have it, the Magnificent Seven(ish) tools to try using with the SEO Spider, combined to form the deadliest gang in the west web. Hopefully, you find some of these useful, but I’d love to hear if you have any other suggestions to add to the list.
The post SEO Spider Companion Tools, Aka ‘The Magnificent Seven’ appeared first on Screaming Frog.
0 notes
Link
via Screaming Frog
From crawl completion notifications to automated reporting: this post may not have Billy the Kid or Butch Cassidy, instead, here are a few of my most useful tools to combine with the SEO Spider, (just as exciting).
We SEOs are extremely lucky—not just because we’re working in such an engaging and collaborative industry, but we have access to a plethora of online resources, conferences and SEO-based tools to lend a hand with almost any task you could think up.
My favourite of which is, of course, the SEO Spider—after all, following Minesweeper Outlook, it’s likely the most used program on my work PC. However, a great programme can only be made even more useful when combined with a gang of other fantastic tools to enhance, compliment or adapt the already vast and growing feature set.
While it isn’t quite the ragtag group from John Sturges’ 1960 cult classic, I’ve compiled the Magnificent Seven(ish) SEO tools I find useful to use in conjunction with the SEO Spider:
Debugging in Chrome Developer Tools
Chrome is the definitive king of browsers, and arguably one of the most installed programs on the planet. What’s more, it’s got a full suite of free developer tools built straight in—to load it up, just right-click on any page and hit inspect. Among many aspects, this is particularly handy to confirm or debunk what might be happening in your crawl versus what you see in a browser.
For instance, while the Spider does check response headers during a crawl, maybe you just want to dig a bit deeper and view it as a whole? Well, just go to the Network tab, select a request and open the Headers sub-tab for all the juicy details:
Perhaps you’ve loaded a crawl that’s only returning one or two results and you think JavaScript might be the issue? Well, just hit the three dots (highlighted above) in the top right corner, then click settings > debugger > disable JavaScript and refresh your page to see how it looks:
Or maybe you just want to compare your nice browser-rendered HTML to that served back to the Spider? Just open the Spider and enable ‘JavaScript Rendering’ & ‘Store Rendered HTML’ in the configuration options (Configuration > Spider > Rendering/Advanced), then run your crawl. Once complete, you can view the rendered HTML in the bottom ‘View Source’ tab and compare with the rendered HTML in the ‘elements’ tab of Chrome.
There are honestly far too many options in the Chrome developer toolset to list here, but it’s certainly worth getting your head around.
Page Validation with a Right-Click
Okay, I’m cheating a bit here as this isn’t one tool, rather a collection of several, but have you ever tried right-clicking a URL within the Spider? Well, if not, I’d recommend giving it a go—on top of some handy exports like the crawl path report and visualisations, there’s a ton of options to open that URL into several individual analysis & validation apps:
Google Cache – See how Google is caching and storing your pages’ HTML.
Wayback Machine – Compare URL changes over time.
Other Domains on IP – See all domains registered to that IP Address.
Open Robots.txt – Look at a site’s Robots.
HTML Validation with W3C – Double-check all HTML is valid.
PageSpeed Insights – Any areas to improve site speed?
Structured Data Tester – Check all on-page structured data.
Mobile-Friendly Tester – Are your pages mobile-friendly?
Rich Results Tester – Is the page eligible for rich results?
AMP Validator – Official AMP project validation test.
User Data and Link Metrics via API Access
We SEOs can’t get enough data, it’s genuinely all we crave – whether that’s from user testing, keyword tracking or session information, we want it all and we want it now! After all, creating the perfect website for bots is one thing, but ultimately the aim of almost every site is to get more users to view and convert on the domain, so we need to view it from as many angles as possible.
Starting with users, there’s practically no better insight into user behaviour than the raw data provided by both Google Search Console (GSC) and Google Analytics (GA), both of which help us make informed, data-driven decisions and recommendations.
What’s great about this is you can easily integrate any GA or GSC data straight into your crawl via the API Access menu so it’s front and centre when reviewing any changes to your pages. Just head on over to Configuration > API Access > [your service of choice], connect to your account, configure your settings and you’re good to go.
Another crucial area in SERP rankings is the perceived authority of each page in the eyes of search engines – a major aspect of which, is (of course), links., links and more links. Any SEO will know you can’t spend more than 5 minutes at BrightonSEO before someone brings up the subject of links, it’s like the lifeblood of our industry. Whether their importance is dying out or not there’s no denying that they currently still hold much value within our perceptions of Google’s algorithm.
Well, alongside the previous user data you can also use the API Access menu to connect with some of the biggest tools in the industry such as Moz, Ahrefs or Majestic, to analyse your backlink profile for every URL pulled in a crawl.
For all the gory details on API Access check out the following page (scroll down for other connections): https://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#google-analytics-integration
Understanding Bot Behaviour with the Log File Analyzer
An often-overlooked exercise, nothing gives us quite the insight into how bots are interacting through a site than directly from the server logs. The trouble is, these files can be messy and hard to analyse on their own, which is where our very own Log File Analyzer (LFA) comes into play, (they didn’t force me to add this one in, promise!).
I’ll leave @ScreamingFrog to go into all the gritty details on why this tool is so useful, but my personal favourite aspect is the ‘Import URL data’ tab on the far right. This little gem will effectively match any spreadsheet containing URL information with the bot data on those URLs.
So, you can run a crawl in the Spider while connected to GA, GSC and a backlink app of your choice, pulling the respective data from each URL alongside the original crawl information. Then, export this into a spreadsheet before importing into the LFA to get a report combining metadata, session data, backlink data and bot data all in one comprehensive summary, aka the holy quadrilogy of technical SEO statistics.
While the LFA is a paid tool, there’s a free version if you want to give it a go.
Crawl Reporting in Google Data Studio
One of my favourite reports from the Spider is the simple but useful ‘Crawl Overview’ export (Reports > Crawl Overview), and if you mix this with the scheduling feature, you’re able to create a simple crawl report every day, week, month or year. This allows you to monitor and for any drastic changes to the domain and alerting to anything which might be cause for concern between crawls.
However, in its native form it’s not the easiest to compare between dates, which is where Google Sheets & Data Studio can come in to lend a hand. After a bit of setup, you can easily copy over the crawl overview into your master G-Sheet each time your scheduled crawl completes, then Data Studio will automatically update, letting you spend more time analysing changes and less time searching for them.
This will require some fiddling to set up; however, at the end of this section I’ve included links to an example G-Sheet and Data Studio report that you’re welcome to copy. Essentially, you need a G-Sheet with date entries in one column and unique headings from the crawl overview report (or another) in the remaining columns:
Once that’s sorted, take your crawl overview report and copy out all the data in the ‘Number of URI’ column (column B), being sure to copy from the ‘Total URI Encountered’ until the end of the column.
Open your master G-Sheet and create a new date entry in column A (add this in a format of YYYYMMDD). Then in the adjacent cell, Right-click > ‘Paste special’ > ‘Paste transposed’ (Data Studio prefers this to long-form data):
If done correctly with several entries of data, you should have something like this:
Once the data is in a G-Sheet, uploading this to Data Studio is simple, just create a new report > add data source > connect to G-Sheets > [your master sheet] > [sheet page] and make sure all the heading entries are set as a metric (blue) while the date is set as a dimension (green), like this:
You can then build out a report to display your crawl data in whatever format you like. This can include scorecards and tables for individual time periods, or trend graphs to compare crawl stats over the date range provided, (you’re very own Search Console Coverage report).
Here’s an overview report I quickly put together as an example. You can obviously do something much more comprehensive than this should you wish, or perhaps take this concept and combine it with even more reports and exports from the Spider.
If you’d like a copy of both my G-Sheet and Data Studio report, feel free to take them from here: Master Crawl Overview G-Sheet: https://docs.google.com/spreadsheets/d/1FnfN8VxlWrCYuo2gcSj0qJoOSbIfj7bT9ZJgr2pQcs4/edit?usp=sharing Crawl Overview Data Studio Report: https://datastudio.google.com/open/1Luv7dBnkqyRj11vLEb9lwI8LfAd0b9Bm
Building Functions & Strings with XPath Helper & Regex Search
The Spider is capable of doing some very cool stuff with the extraction feature, a lot of which is listed in our guide to web scraping and extraction. The trouble with much of this is it will require you to build your own XPath or regex string to lift your intended information.
While simply right-clicking > Copy XPath within the inspect window will usually do enough to scrape, by it’s not always going to cut it for some types of data. This is where two chrome extensions, XPath Helper & Regex- Search come in useful.
Unfortunately, these won’t automatically build any strings or functions, but, if you combine them with a cheat sheet and some trial and error you can easily build one out in Chrome before copying into the Spider to bulk across all your pages.
For example, say I wanted to get all the dates and author information of every article on our blog subfolder (https://www.screamingfrog.co.uk/blog/).
If you simply right clicked on one of the highlighted elements in the inspect window and hit Copy > Copy XPath, you would be given something like: /html/body/div[4]/div/div[1]/div/div[1]/div/div[1]/p
While this does the trick, it will only pull the single instance copied (‘16 January, 2019 by Ben Fuller’). Instead, we want all the dates and authors from the /blog subfolder.
By looking at what elements the reference is sitting in we can slowly build out an XPath function directly in XPath Helper and see what it highlights in Chrome. For instance, we can see it sits in a class of ‘main-blog–posts_single-inner–text–inner clearfix’, so pop that as a function into XPath Helper: //div[@class="main-blog--posts_single-inner--text--inner clearfix"]
XPath Helper will then highlight the matching results in Chrome:
Close, but this is also pulling the post titles, so not quite what we’re after. It looks like the date and author names are sitting in a sub <p> tag so let’s add that into our function: (//div[@class="main-blog--posts_single-inner--text--inner clearfix"])/p
Bingo! Stick that in the custom extraction feature of the Spider (Configuration > Custom > Extraction), upload your list of pages, and watch the results pour in!
Regex Search works much in the same way: simply start writing your string, hit next and you can visually see what it’s matching as you’re going. Once you got it, whack it in the Spider, upload your URLs then sit back and relax.
Notifications & Auto Mailing Exports with Zapier
Zapier brings together all kinds of web apps, letting them communicate and work with one another when they might not otherwise be able to. It works by having an action in one app set as a trigger and another app set to perform an action as a result.
To make things even better, it works natively with a ton of applications such as G-Suite, Dropbox, Slack, and Trello. Unfortunately, as the Spider is a desktop app, we can’t directly connect it with Zapier. However, with a bit of tinkering, we can still make use of its functionality to provide email notifications or auto mailing reports/exports to yourself and a list of predetermined contacts whenever a scheduled crawl completes.
All you need is to have your machine or server set up with an auto cloud sync directory such as those on ‘Dropbox’, ‘OneDrive’ or ‘Google Backup & Sync’. Inside this directory, create a folder to save all your crawl exports & reports. In this instance, I’m using G-drive, but others should work just as well.
You’ll need to set a scheduled crawl in the Spider (file > Schedule) to export any tabs, bulk exports or reports into a timestamped folder within this auto-synced directory:
Log into or create an account for Zapier and make a new ‘zap’ to email yourself or a list of contacts whenever a new folder is generated within the synced directory you selected in the previous step. You’ll have to provide Zapier access to both your G-Drive & Gmail for this to work (do so at your own risk).
My zap looks something like this:
The above Zap will trigger when a new folder is added to /Scheduled Crawls/ in my G-Drive account. It will then send out an email from my Gmail to myself and any other contacts, notifying them and attaching a direct link to the newly added folder and Spider exports.
I’d like to note here that if running a large crawl or directly saving the crawl file to G-drive, you’ll need enough storage to upload (so I’d stick to exports). You’ll also have to wait until the sync is completed from your desktop to the cloud before the zap will trigger, and it checks this action on a cycle of 15 minutes, so might not be instantaneous.
Alternatively, do the same thing on IFTTT (If This Then That) but set it so a new G-drive file will ping your phone, turn your smart light a hue of lime green or just play this sound at full volume on your smart speaker. We really are living in the future now!
Conclusion
There you have it, the Magnificent Seven(ish) tools to try using with the SEO Spider, combined to form the deadliest gang in the west web. Hopefully, you find some of these useful, but I’d love to hear if you have any other suggestions to add to the list.
The post SEO Spider Companion Tools, Aka ‘The Magnificent Seven’ appeared first on Screaming Frog.
0 notes