#matrixnet | Explore Tumblr Posts and Blogs

russianseo · 7 years

Text

Yandex released a new machine learning library in public access

Yandex has developed a new method of machine learning called CatBoost. It allows you to efficiently train models in line with heterogeneous data - such as user location, operation history and device type. The library of computer training CatBoost is released in public access, it can be used by all comers. To work with CatBoost, it's enough to install it on your computer. The library supports Linux, Windows and macOS operating systems and is available in Python and R programming languages. Yandex has also developed a CatBoost Viewer visualization program that allows you to monitor the learning process on the charts. You can download CatBoost and CatBoost Viewer on GitHub.

CatBoost is the heir of the Matrixnet machine learning method, which is used in almost all Yandex services. Like Matrixnet, CatBoost employs the mechanism of gradient boosting: it is well suited for working with heterogeneous data. But whereas Matrixnet teaches models on numerical data, CatBoost also takes into account the non-numerical ones, for example cloud types or types of buildings. Previously, such data had to be translated into the language of figures, which could change their essence and affect the accuracy of the model. Now they can be used in their original form. Thanks to this, CatBoost shows a higher quality of training than similar methods for working with heterogeneous data. It can be used in a variety of areas - from the banking sector to industrial needs.

Mikhail Bilenko, head of the of Yandex machine intelligence and research department:

"Yandex has been engaged in machine training for many years, and CatBoost was created by the best specialists in this field. By releasing the library CatBoost in open access, we want to contribute to the development of machine learning. I must say that CatBoost is the first Russian method of machine learning, which became available in the open source. We hope that the community of experts will appreciate it and will help to do even better."

The new method has already been tested on Yandex services. As part of the experiment, it was used to improve search results, to rank the Yandex.Den recommendations tape and to calculate the weather forecast in Meteum technology - and in all cases proved to be better than Matrixnet. In the future, CatBoost will work on other services. It is also used by the Yandex Data Factory team - for their solutions for the industry, in particular for optimizing raw material consumption and predicting defects.

In addition, CatBoost was implemented by the European Center for Nuclear Research (CERN): they are using it to combine data obtained from different parts of the LHCb detector.

#CatBoost #Matrixnet #yandex #CERN #LHCb #LHCb detector

0 notes

motozhituha · 7 years

Photo

Бабушка остывает, походу. #бабушка #яндекс #matrixnet #meteum #yandex #яндекспогода #гробгробкладбищепидор

#яндекс #meteum #бабушка #гробгробкладбищепидор #yandex #яндекспогода #matrixnet

0 notes

domvolshebnika · 5 years

Link

0 notes

squaredaisy · 12 days

Video

BB-What is MatrixNET from Square Daisy on Vimeo.

0 notes

squaredaisy · 12 days

Video

BB-What is MatrixNET from Square Daisy on Vimeo.

0 notes

russianseo · 7 years

Text

Machine learning via Yandex search or how Matrixnet is organized

The user is arriving to the search engine website, posts his request and the task of the search engine is in providing the top most relevant documents in line with the request. The documents that can be attributed under the given request are very numerous and there are billions of them within the index, even after filtering it all, there are still millions of them online And you need to get all of those millions in perfect order. To help you actually compile the ranging formula, you will be able to benefit from machine learning and namely – Matrixnet, personal algorithm of gradient Yandex boosting.

Matrixnet is a gradient boosting on the trees of solutions that is supporting all the main features – classifications, multiclassifications, regressions, ranging and so on. There are more complex features – the combination of all the above. Our department is developing new features for the needs of similar departments and internal Yandex users will be capable of adding their own features as well.

Matrixnet is also capable of working with lost and misplaced values – if the value of some factor is not indicated, it is not going to be an issue. Besides that, Matrixnet learning can be launched on a cluster – it is a redistributed algorithm. It is important, since the learning options within the search engine are too big to fit into the operating memory of a single server – this is why it is important to perform redistributed learning.

Applying Matrixnet within Yandex

In Yandex Matrixnet is being applied all over the place. First of all, within the search engine. Matrixnet was initially written for the search engine – it is a fact. Secondly, it is being applied within the advertising in order to demonstrate the users all the most interesting ads for them, predicting the number of clicks on the ads. Thirdly, weather report in Yandex is being developed in line with the Matrixnet formula. The algorithm is also being applied in the external Yandex projects – YDF, within the system of recommendations Yandex.Dzen, for looking for bots, permission for homonymy, user segmentation and so on.

Matrixnet peculiarities

Right now there are several gradient boosting algorithms available for public access, so this is why I will tell you how Matrixnet is different. Its important peculiarity is the fact that you barely need to select its parameters. Why? When the Matrixnet was being written, it was tested in line with a number of different learning selections (pools) in such a way so that it would provide great quality for all, this is why the new data sets are also giving great quality. Matrixnet is not only easy to use because it barely required the selection of parameters, but also because Yandex has the infrastructure, which is allowing to launch learning literally in a single click (see more info below). Matrixnet is winning in terms of quality over all the other gradient boosting algorithms on the trees of solutions within the regression mode.

Matrixnet is offering a fairly optimized learning feature. It is important for all the Yandex tasks, but mainly for the search, Although we do have large learning selections, we cannot allow the formula to learn for a month, since the overall quality will be suffering in the end of it all. This is why various optimizations are being applied, both algorithmic and low level along with the ones that are pressuring the net. Using the Matrixnet formula is also very optimized (in just one second the formula in a single current could be applied for 100000 documents).

Gradient boosting on the trees of solutions

Trees of solutions are a structure of data – a binary tree, where all the data nodes, aside from the sheet ones, have gradation in line with a certain factor or number and the sheet heights have numbers. This is how you can apply such a tree to a document:

Gradient boosting – it is a sum of simple models (tress of solutions in this case) each and every single one of which is enhancing the result of the previous combination.

Matrixnet – represents the none arbitrary trees of solutions and the so-called “oblivious trees of solutions”, where every single layer has gradation in line with a single feature as well as a single number. Such a way to build a tree has a number of peculiarities:

acquiring very simple models that are resilient to relearning

grading the space using hyperplanes, meaning that in order to calculate the positions within the sheets, you will need to calculate all the gradations, so the order does not matter

 regularization. You will need to guarantee the lack of sheets that almost do not have objects, this is why you need to come up with different regularizations in order to penalize such situations

Cluster learning

There are several ways how gradient boostings on the trees of solutions are paralleled on several servers:

in line with the signs

in line with the documents

Should we parallel the learning in line with the signs (when different signs are located on several servers), then the overall amount of information that you will need to resend via the net is going to be proportional with the number of documents. Since the number of documents is huge and is constantly rising, we cannot afford this and are paralleling the learning via documents.

Narrow place in terms of learning all the gradient boosting on the trees of solutions is the selection of the tree’s structure, meaning a number of signs that will build our next tree. The selection is made from the two options:

master-slave mode, when a single leading node and a selection of slaves, each of which is caclculating certain statistics in line with some signs and sends them to the master, which will aggregate them and will select the best sign 

the all radius mode, when there is no dedicated master and every single node is calculating all the statistics and aggregates it on itself

Each of those approaches has serious downsides. When it comes to the master-slave mode, master is becoming the net’s narrow place, the all radius mode is accumulating too much traffic, because every node will need to receive a lot of information. For example, XGBoost is working in the all radius mode, so it is not paralleled as well. Both of those issues are resolved within Matrixnet using the following solution: when choosing the next tree, every sign chooses a random node, which is being announced by the virtual masterand all the other slaves are communicating with this node. It is aggregating the necessary information, calculates this sign and delivers the result to the master.

We are also striving to minimize the traffic in a number of ways. For example, when choosing the best gradation, we choosing a number of candidates on every slave in order to get better signs. We are delivering the necessary info on merely several signs to the virtual masters. Not generally on all we have, but only in line with the TOP ones available.

Matrixnet in ranging

Commercial factors. Trust

In one of my previous articles I have already mentioned that commercial ranging factors will change the face of traditional SEO. This is article was written even before the Yandex’ announcement about cancelling linking via commercial requests, but it was then that the increased influence was felt when it come to the commercial…

The graph of how over time the dimension of the ranging formula was changing, where the number of iteration – is the number of trees in the model and kilobytes – it is the size of the model.

As we can see, you will need to constantly boost the learning process along with the application of the model in order to correspond to this kind of growth. So how is the machine learning being used for the search engine? For starters, you will need to gather a learning selection that will consist of a number of pairs (document, request). The assessors will rate very single one of such pairs – just how much the document fits the request. Besides that, this line – document, request, assessment will also have certain signs (requested ones, document related ones, document-request related ones as well). If the sign is related to the request, we simply duplicate is for all the request documents.

The model is going to learn in line with the acquired learning selection. Learning modes that are being used in Yandex search:

Regression (point-wise mode): Great = 1, Good = 0.8, Bad = 0 => Minimization MSE • Coupled mode – generating selection of documents pairs with different assessments. The formula is optimizing correct ranging within the pairs.

 Optimizing ranging nDCG function (not smooth, will be impossible to make a step in line with the gradient).

Matrixnet tasks

Automatically generating features, smart choice of features

 Faster learning (CPU, GPU)

 Optimizing rarefied data 

Learning through the cluster with unevenly distributed resources 

New regulations and error functions

 Instruments for analyzing the learning formula

 Predicting the time of learning and the necessary resources

What does a researcher need to perform in Yandex if he need to educate a formula? He will need to deal with a number of very complex and important tasks:

Finding the latest version of the algorithm,

 Gathering the data for learning in the necessary format

 Finding the necessary calculating resources (cluster) 4. Initializing redistributed learning

Yandex has special infrastructure that will deal with all the tasks and will make the life of the researchers so much more straightforward – it is called Nirvana.

Nirvana principles

Nirvana is a platform for launching arbitrary processes. Its key feature is the fact that all the processes within it are being configure as graphics.

Should you take the learning Matrixnet formula for example – the user will create the graph using certain blocks, linking them together and launching. Every single block is an operation that is receiving and generating certain data and through them the connection is being performed. The user is initiating the graph with learning and all the history of the launch will be saved, after which he will be able to view any of preceding graphs, clone it, launch it again and get a guaranteed identical result.

Nirvana is specifically paying attention to reproducibility. Any experiments with machine learning within Yandex are reproducible. Besides that, Nirvana also allows to view the experiment history of all the other users in Yandex, clone their experiment, somehow change and relaunch them. For instance, you can take learning of our search production formula, view the graph, clone, change some sort of parameters and get your very own search formula, which is going to be a whole lot better than the existing one.

Nirvana features plenty of operations – there are about 10 thousand of those right now. Various utilities are present along with a convenient search of operations. If the necessary operation cannot be found, there will be a possibility of creating your personal operation. Nirvana supports the so-called visual programming, which significantly eases up the creation of functions as well as composite operations. And, of course, Nirvana also features the most commonly used machine learning algorithms in yandex – both Matrixnet and neuronet. It is very simple to educate your own neuronet within Nirvana, there is nothing complex about it and you will not need any additional expertise.

Nirvana is a pretty recent system, its alpha version was launched in 2015. But it already has plenty of users – there are more than 2 000 of them (one third of entire Yandex), every week about 500 people are using Nirvana and they are launching about 50 thousand graphs on a weekly basis.

0 notes

russianseo · 7 years

Text

What awaits the SEO industry after the release of Minusinsk

After the release of Minusinsk a new text filter is going to appear and it will hit both the 2 dollars copywriting and the websites that were made for Miralinsk along with everyone, who is preoccupied with filling web pages with useless content for the scope of expanding the semantics. Should we try and analyze the current situation on the PR market in an objective manner and namely – the results of the filters’ actions that Yandex surprised us with last year (AGS and Minusinsk), it becomes quite apparent that something crucial is missing for the “bigger picture”.

Let us avoid all the controversy regarding what “the bigger picture” is by agreeing that each has its own notion. For the optimizers it is the possibility to earn money in a quick and easy manner, using an already established system and for Yandex – it is to ruin this system and to build a new one. Nowadays, despite a pretty harsh pressing that is being cast upon the optimizers by Yandex (which resulted in two thirds of the linking business to be utterly annihilated throughout the last couple of years) – the system was not changed in a fundamental manner. The existing system turn out to be very robust – it is filled with both horizontal and vertical relationships on all levels. The SEO market is linked to the issuance relevance and it does depend on its infrastructure. In many ways, commercial issuance is relevant namely due to the efforts of lazy optimizers and the annihilation of the link selling industry already led to a temporary and yet quite substantial decrease in quality of commercial issuance.

Yes, the modern market is still far from perfect, but it is changing. The recently released filters are basically two parts of the same medal – the new AGS is punishing websites that are selling SEO links, while Minusinsk is punishing the websites that are buying those links. The day when unprofessional optimizers are going to lose control over the linking factor altogether is quite close, but what is coming next?

Well, in that case, the medal is going to be transformed into a pyramid and its third part will be the new text filter. A sort of Minusinsk for articles or, perhaps, it is going to be named Baku – after the city where “Iskra” magazine was being secretly produced – it does not really matter. What matters is the fact that it will hit the 2 dollars copywriting and all the websites that were made for Miralinsk as well as all those that were preoccupied with filling various websites with useless content for the sake of expanding the semantics. The filter is going to cut all the content projects that are not offering ample value and its release is pretty much imminent.

It seems that the idea of pessimizing the number of pages for irrelevance is quite apparent as well, but why does Yandex hesitate when it comes to fulfilling it? Well, for the same reason that AGS was initially launched and why Minusinsk was launched afterwards – in order to ensure security Yandex is consistently including new filters. And right now Yandex is preoccupied with trying to deal with the situation that is related to the inclusion of Minuisnsk, since the changes within ecosystem due to decrease of linking market resulted in changes within the yandex algorithm.

The overall chaos that we are currently witnessing within the issuance is actually easily explained by the updates of the bigdata entering Matrixnet – it provoked an unpredictable distortion of normalities within complex metrics of linking quality. The distortion of those metrics was followed by distortions in the website textual analysis – their relevance of anchors and related keys was closely linked to the incoming and outgoing. Due to such global changes Yandex had to initially include the manual regulation mechanism into the most important topics and then they activated the automatic mechanism as well, which is caching the setting ofmetric ranging normalities and then, after a month or two, it is comparing it to the current changes, captures normalities bugs and corrects them – in other words, it is correcting the algorithm itself. We see those processes as a kind of randomization – quite solid periodic changes in traffic and jumps in positions, which are pretty much impossible to analyze within the current conditions. The basics of such a useless analysis are the attempts of capturing the “many-handed bandit”.

The described processes are going to continue and via the “amplitude” of those swings it will be possible to judge the overall “health condition” of the algorithm that is trying to compensate the empowered pressure of the Minusinsk algorithm. When the stability of the search engine will stop being threatened by anything, and the overall peak of resistance of the surrounded ecosystem will pass, the time for the new filter will come – the article related brethren of Minusinsk. But not before. Minusinsk, whose mission is to eliminate the linking marketing industry, is operating quite successfully – Sapa and similar resources were damaged quite severely. When it comes to article links, it is not as straightforward, since SEO links within articles are hardly ever established by Yandex. If, for example, you are going to be using the measuring system of linking weight, it will become obvious that the weight of links with articles is generally much bigger than the weight of the usual sapalinks and the number of SEO links within the topical selection of Sapa, which was checked by the SEO links filter, in most cases reaches 90% against 10% in the articles. So it is only natural that the SEO marketing links industry is quickly moving towards the articles.

It is also apparent that the future fight with useless links will not be going within the notions of SEO link/ none-SEO link, since the majority of links with articles are not being located within SEO links (and are classified as “poor quality advertising, conceptually not being subject to prohibition, since the existence of advertising links is completely natural), but within the notion filed SEO article/none-SEO article. It implies that the overall quality of the article itself is going to fall under analysis, both within the website and within the limits of the topic.

It is a known fact that there are two types of document quality quorums and they are determining whether the given document is going to indexed/left within the index or not. If the same quality documents are prevailing online, it will be difficult for it to pass the quorum of usefulness among all the already existing ones. And if the website where such a document is being posted has bigger trust and ensures certain internal traffic via that document (meaning that is passed the usefulness quorum for the given website), it is a great basis to leave it within the index itself.

A certain index rotation of documents that exist within it is quite within the characteristics for larger web pages. The more commonly known metric of a document’s uselessness within a website is the sematic similarity or duplication. Optimizers are well aware that duplications could lead to poor website ranging, but the upcoming filters will not be limited by this metric. It is going to pessimize for the number of pages that are taking part within the index rotation. This means that if right now the pages that are not passing the usefulness quorum (for instance, it is not the right season for the product, which is described on the give page), are not resulting in pessimization, in the future even an extensive quantity of those on a page due to natural causes are going to lead to sanctions for the new filter. Since IT IS PROHIBITED for anyone to have a huge assortment of products that could mostly be attributed to the larger resellers.

And it is going to be true for the old commercial websites as well, so I recommend you to get rid of those unpopular pages or prohibit them via robotx.txt, letting the search engine know that these pages only have archaic value and will not be part of the ranging issuance.

This is mostly linked to ecommerce and as for the relevance of the pages that are attributed to the websites made for Miralinks for which the usefulness quorum is not being controlled for the vast majority of posted articles – right now they are simply being thrown out of the index, but the new filter is going to punish for it in the future.

The new filter, much like the Minusinsk, will be update and will have two factors – in line with the quantity of useless pages and in line with their quality. Remember what I wrote in my last article about the allowed thresholds of AGS and Minusinsk that are necessary for the filter to work, the filter’s thresholds are going to be increased with every new iteration and will detroy bad articles in the first place.

Prophylaxis methods:

Not to post any links on bad websites in articles that are not being read by anyone and will never be read

Not to order and purchase and post such articles

Create interesting, live satellites that will surely be read by interested individuals

0 notes