Google, Bing & Co. - How do search engines work?

3/30/2022
Time reading time

What is a search engine?

A search engine is a program that searches for content in the browser. This content is stored on the Internet or in a closed database. After entering a search query using text - or, as is more common these days, via voice - the search engine delivers a long list of results that point to the relevant documents.

General process

Search engines do not search the entire Internet, but only a part of it, namely the World Wide Web known to all of us. The Internet also includes services such as e-mail, chats, Internet telephony and data transmission. In general, the process can be divided into three steps. The systems are of course much more complex than described in the following article. You can therefore regard this as a short introduction to the world of search engines. The first step is recording. The search engine is constantly collecting new information, which is helped by so-called crawlers. These will be discussed in more detail later on. Once collected, information is processed and indexed. At this point, all the collected data is processed in such a way that an index can be created from it. The index is the core of web search. This is followed by the step of providing information. Here, the appropriate index is searched for each request.

What is an index?

The index is the database of a search engine operator. It forms the structure that you need for a database. This makes searching and sorting for specific fields easier and faster. The index summarizes all previous pages that have been processed by the search engine itself. In this case, processed means that the search engine has found and analyzed the page and permanently saved the relevant content of the page. You can think of the index as a very large database. It contains all content that has already been collected and content that is yet to be collected. The index therefore plays an extremely important role.

From search query to result

In our case, we type “WiFi for my business” in the search bar. This step is visible to us as users. But what happens after or when we press the ENTER key? First, a search for the right data center must take place. Search engine operators all over the world own data centers with servers on which the index is stored. As a result, one or more data centers that match the search query are immediately selected in the background. Factors such as spatial proximity, speed and utilization rate play an extremely important role here. The subsequent search through the index is another invisible step. With major search engines such as Google, Yahoo and Bing, etc., the index is specially built for parallel queries. This means that each server only needs to perform a portion of the query so that the index can be searched faster. It is also important to mention that a search engine does not search for the specific word, but for letter patterns. After the search history has been carried out and a first search result has been compiled, another check must be carried out. In fact, it may be that the user has made a commitment or has written terms that are written together separately. I'm sure you know that yourself. In a hurry, it's easy to make up your mind and Google offers you the “Did you mean” function. The search engine can understand this and offers a more suitable term if required. After a search through the index has been carried out, it is now necessary to search for the correct document. The index servers consist primarily of words and addresses. They point to the so-called doc servers. The doc servers contain titles, text strings and other data of the saved documents. To speed up the process a bit, you can cache results from frequent search queries. This means that you do not have to search the index again each time, but can access the cached data. In the last step, the search information must be displayed to the user. After all, we want to see at first glance the most important websites for the entered search term. The user usually clicks on one of the first ten search results. After all these steps, as users, we now see a long list of results for our search query “WiFi for my company.” As you can see, many complex but interesting steps take place in 0.57 seconds. The search engine is not a person. However, so that the documents can be understood by the search engine and appropriate information preparation can take place, the following steps are necessary: 1. NormalizationData normalization is about storing all necessary and relevant information. All unnecessary information about programming and formatting is removed here. As a result, the document can be better understood and analyzed by the search engine in the next steps.2. TokenizingFor the search engine, the incoming document consists of random characters. In order for her to be able to recognize keywords, it is important that she can read out individual words. For this purpose, word separators such as spaces or special characters such as # or + are used. These help the search engine identify word boundaries.3. Lower case convertIndividual words are automatically converted to lowercase, as this proves to be significantly easier for further processing, analysis and comparisons.4. Language DetectionThe next step is voice recognition. If you search for something in German, you can of course also expect German-language results. To do this, the search engine uses various systems to be able to recognize the correct language. The language is identified by comparing it with dictionaries and other documents.5. Basic form reduction through word stemmingWord stemming brings every word back to its basic form. In this way, related words can be brought together. As a result, a basic word form is saved and all other forms are below it. This proves to be particularly efficient for the search engine, as the volume of documents is reduced.6. Stop word analysisIn this step, the stop word analysis takes place. Stop words are words like with, you, but, with or we. To understand the meaning of the text, these words are not relevant. By comparing them with a list of stopwords, these words are removed from the search engine. 7. Keyword Extraction and AnalysisNow that the search engine has identified the terms relevant to it using a type of filter, it is now important to check the following aspects: On the one hand, the spelling is checked, and on the other hand, synonyms and homonyms are identified. Keyword identification is also important for the right result. For a term to count as a keyword, the following three criteria must be met:

  • Visual word validity: The terms must serve as relevant keywords. Accordingly, conjunctions and negations must be excluded.
  • Weighting validity: The question here is to what extent the key term is important for the content of the text.
  • Cluster validity: The keywords should be oriented so that they can link to other documents.

8th word IDIn the last step, the terms are given a unique wordID. For example, “WLAN for my company” simply becomes #6678, saving the system space and size because the WordID is shorter.