Elastic search — A guide to conceptually learn and implement
Having used Elastic search (ES) once in my trip so far, I’d like to discuss the key points from a learning and real-time implementation viewpoint.
ES is a search engine built on top of Lucene that uses an inverted index data structure for searching. It is a highly available, distributed, replicated, document-based, near real-time web server.
Because writes are first written to the memory buffer, ES is nearly real-time. They are written to in the memory section during the refresh procedure. Each refresh takes place once every second. They become searchable once they are written to in the memory segment. Refresh occurs after 1 second, making it almost real-time. You can configure a refresh to happen in less than a second, although it’s not recommended, as every write would trigger a write to a memory segment.
ES Indexes: In ES, the idea of an index is simply a grouping of documents, each of which is made up of a number of fields. Each index contains n shards. There will by default be 5 primary and 1 replica when you establish an index. A Lucene index is a collection of Lucene segments, and each shard is a Lucene index. An inverted index represents each Lucene segment.
ES Scoring: The document’s score indicates how pertinent it is to the search query. A score is a positive figure. Term Frequency/Inverse Document Frequency is the formula. The number of times the search word appears in the document is referred to as “phrase frequency.” More frequency equates to greater relevance. The number of times the search word appears across all documents is known as the “inverse document frequency.”
ES has the following data types. Highlighting the commonly used ones only:
- boolean
- date
- numeric — byte, short, integer, long, float, double
- keyword
- text
- binary
- spatial — geo_point(lat and long), geo_shape(polygons)
- ip
- histogram
- object
ES Refresh and Flush — With an ES refresh, data is simultaneously written to translog and from the in-memory buffer to the in-memory segment, mostly for durability. This occurs once every second. A new segment is formed in memory each time a refresh occurs after one second. The term “ES flush” refers to the process of taking tiny memory segments, producing a larger segment in memory, writing it to the disc, and clearing the translog. The flush occurs every five seconds or when the translog hits 512 MB, not every second. You can adjust the flush time to be longer. The flushing procedure is known as Lucene commit.
ES Nodes. There are various nodes in ES but the commonly used ones are:
- Master Node: The master node is in responsibility of assigning shards to nodes, creating indexes, and monitoring the cluster’s health, which includes all active and inactive nodes. We would prefer to have a certain master node.
- Data Node: is the node that holds data.
- Coordinating Node: Any node in the ES can receive the request from the coordinating node. The coordinating node is the one that receives the request. After receiving the request, the coordinating node sends parallel requests to each data node. Once the data nodes have responded, it gathers their responses and transmits them back to the client.
Typically, a dedicated master node and a minimum of three data nodes are used in production environments. Each index is made up of several shards. The shards are dispersed over and each shard can hold 2 billion documents. More parallel processing can be done as data is dispersed among more servers, but at the expense of the servers’ financial investment. They must be weighed.
Analyzers and ES Analysis: The process of analysis involves breaking down the text into tokens that may be added to a searchable inverted index. The analysis is carried out by analyzers, which are just a collection of the following:
- Character Filter: You can add or delete characters from the stream you are receiving using the character filter. For instance, you can add or remove HTML parts or convert between different language formats. Character filters can have zero or more instances.
- Tokenizer: The process of dividing the text into several tokens is known as tokenization. For instance, when a white space is used as a delimiter, a whitespace tokenizer will generate numerous tokens. There can only be one tokenizer.
- Token Filter: It’s the process of adding or removing tokens from the dataset, for example, a lowercase token filter would convert the tokens to lower text. stop token filter removes stop words from the token stream. Few stop words are the, at, are, no, not, this, or, then, etc. A synonym token filter is the process of defining and handling synonyms. you can create custom-defined synonyms. There can be zero or more token filters.
Analyzers are applicable to only text datatype and not to keywords. Keywords does a direct match and analyzers are not applicable to keywords.
Predefined Analyzers (Calling out the important ones):
- Standard Analyzer(This is the default one). It has a Whitespace tokenizer, lowercase token filter, and stop token filter(disabled by default)
- Simple Analyzer
- Whitespace Analyzer
- Stop Analyzer
- Language Analyzer
- Pattern Analyzer
Bespoke analyzers can be created and then mapped to the fields of a document at the time the index is created. The use of analyzers is relevant during the indexing and searching processes.
The context of the query and the filter
The extent to which the document matches the search phrase is discussed in the query context. In this case, score is relevant. If a document fits the search word or query, the filter context indicates this. It’s a simple boolean response. There are no points awarded. For instance, whether or not a person is a member of a group. Just yes or no can be answered.
Domain-Specific Language (DSL) It is a technique for formulating Elastic Search queries. It has these features:
- Leaf Queries: include words, matches, and ranges. Format, more than, greater than equal to, less than, and less than equal to are examples of circumstances where range can be utilised. A term query can be used if you require an exact match against a field. The results of a match query are documents that match a given text. The text that is given is examined before matching.
- Combinational Queries: It is used to wrap other leaf inquiries or compound queries, like dis max or bool, by joining them together.
- Allow Expensive Queries: These queries slow down the system since they take longer to execute and are enabled by default. The way they are used has changed as a result of their inherently flawed design. Script queries, fuzzy queries, range queries, prefix queries, wildcard queries, geoshape queries belong to expensive queries type.
Depending on whether they are used in a query context or a filter context, query clauses act differently.
Aggregations: ES is not only a search Engine but also Analytics, Statistics engine. You can achieve them through aggregations. The following are aggregations in ES
- Metrics Aggregations: ES is an analytics and statistics engine in addition to being a search engine. These can be attained using aggregations. The following are ES Metrics aggregations. Aggregation: Minimum, maximum, count, average, total, percentile, weighted average, top hits, and top metrics
- Bucket Aggregation: Histogram, IP range, geographic distance, range, filter, terms, parent, child, sampler, and bucket aggregation
- Pipeline Aggregation: Minimum, Maximum, Average, Sum, Statistics, Percentile, Statistics Bucket.
Use Cases of ES: ES is a very good fitment for the following use cases:
- Textual Search — Document Search
- Email Body
- App Logs
- Product Search
- Auto Suggest
- Auto Complete
As long as the intention is to search many times and write less, ES can be appreciated and leveraged to the maximum.
Factors to consider in using ES: The following are the things to be considered when you want to use ES:
- You use ES only when the search-write ratio is 70–30. 70% search and 30% write. Do not use ES as a database.
- How many ES servers and how many shards we need have to be determined. We would need to know the number of search hits. Though ES is a peer-to-peer architecture pattern, there is a primary and replica concept — there are primary and replica shards. Since each shard can have 2 billion documents, we would need to know and define the number of shards at the time of index creation.
- How many Indexes to have? This is purely linked to the business use cases.
- Are we going to use expensive queries? Expensive queries take a longer time due to their inherent implementation and slow down the system. examples of expensive queries are script queries, range queries, wildcard queries, fuzzy queries, and prefix queries.
- Do we really need to have custom analyzers? Analyzers only work on text fields. In the name of having custom analyzers, do not complicate things. software is meant to be kept simple.
- Ensure that ES instances process only ES requests. This way it becomes a dedicated ES node.
Thanks