Search, big data and log analysis: a coming of age story
by Gal Berg, CTO of XpoLog
The evolution of search is a fascinating story that can tell us a lot about how to solve other big data challenges such as log management and log analysis in IT environments. After all, in all these cases the basic idea is finding the “needle in a haystack” as quickly as possible.
Organizing data into directories
Entrepreneurs identified the need to organize Web sites so that users could discover them easily in the early days of the Web. Companies such Yahoo emerged with directories, which required the users to browse through categories of interest, just like the yellow pages. The directories grew, and search engines were eventually added to make finding Web sites much easier. The directory method was largely based on human content curation. The human element kept Web sites organized in the right semantic context, but that approach could not keep up with scale as the Web grew larger. New technology became necessary to manage the complexity.
The introduction of first-generation search engines
The first generation of search engines, which included AltaVista and Lycos, used Web crawlers and indexes to capture the Web, and offered the ability to search based on keywords. Naturally, the focus of these technologies was matching between keywords and the index without really understanding the semantic meaning of the text. Algorithms were not yet “smart” enough to signify the importance or relevance of Web pages to the user’s query. The search results were only based on the matching between keywords and the indexed Web pages. Users would have to deal with a multitude of irrelevant results without additional techniques that add context and meaning to the search query. So, for example, a search for the movie “home alone” would return many meaningless results with Web pages that contain the words “home” and “alone”.
Semantic analysis, ranked results and additional context
The next big revolution came with Google and its famous page-ranking algorithm. Google, as opposed to other players at the time, used additional factors to rank search results by treating everyone as an individual. Google profiles every user (e.g. geo location, search history, etc.), and its search engine performs a semantic analysis of the Web pages and the search query and an authority ranking, which was influenced by the amount of links to the specific Web page. Additional signals add context to the search results and help to expedite finding the most relevant pages in a fast growing Web. In essence, Google reintroduced the human aspect (location, history, links) to the mix in order to add meaning to its results. In our previous example, Google would use the semantic analysis to know that “home alone” is the title of movie, provide screening information that’s relevant to the user’s location (local theaters) and return additional information from authoritative sites such as IMDB. Furthermore, the addition of auto-complete to the search query also helped users to discover more content and search more effectively.
The evolution of log management and analysis
How does this relate to software development? IT organizations have always understood the importance of using the logs files, which are generated from infrastructure elements and applications, in order to identify security breaches and troubleshoot faults. However, analyzing these log files wasn’t feasible due to the huge amounts of data, the inconsistent structure of logs, and the lack of standardization – a classic big data challenge. In recent years, a few technological solutions emerged to solve this problem, including Splunk, XpoLog, ArcSight, LogLogic and more. Those were equivalent to first generation Web search engines.
The first generation log management/analysis technologies
As was the case with Web search, those log management technologies initially focused on the input side, which in itself was extremely challenging, as millions of events had to be normalized into some kind of structure that would add meaning and consistency for aggregation. At the other end was a search engine that enables users to find specific faults and various symptoms in order to perform root cause analysis. The users of these systems had become detectives with these tools, and this kind of access to the data and the success of their troubleshooting efforts relied on their investigative skills. They had to know what to search for, and then decipher the clues within thousands of results to deepen their investigation until they found a resolution. The reliance on the user’s skills, and the fact that large amounts of data was returned without a relevancy mechanism, meant inconsistent and time-intensive processes.
Preconfigured search rules for security applications
Then came the Google-like approaches. First generation log management solutions such as ArcSight, Network Intelligence, and Splunk, originally focused on security applications, using logs from infrastructure elements to identify and troubleshoot attacks and system vulnerabilities. These rules offloaded the “knowledge” of what to search for from the user to the solution, helping the users with their investigation efforts. The use of pre-existing knowledge in these solutions was possible because the infrastructure layer systems that they monitored were manufactured by a handful of large-scale manufacturers (Cisco, Dell, HP, IBM, etc.), which meant that the logs and error messages were somewhat consistent and relatively easy to analyze. However, this is not the case in today’s complex virtual IT world and application-based logs.
The application logs challenge
The system layer is monolithic, but the application layer springs from a multitude of developers, including in-house, commercial teams, etc. There are no industry standards for documenting and managing log events, so these applications are extremely inconsistent both in terms of the IDs used for events and their wording. Troubleshooting problems in such a chaotic environment is significantly much more challenging than the relatively “organized” infrastructure layer. Additional intelligence was required.
Augmented search – improving search with intelligence amplification (IA)
Intelligence amplification (not to be confused with Artificial Intelligence) refers to the use of information technology to augment human intelligence. It complements human intelligence by extending the information processing capabilities of individual developers. The newest generation of log analysis and management solutions employs this capability, because the industry came to the realization that users don’t necessarily know what to search for, especially in the chaotic application layer environment. These solutions combine semantic processing, statistical models, and machine learning to analyze and “understand” the events in the logs, and then display results with intelligence layers that add meaning and context, much like Google does.
The analysis algorithm automatically identifies errors, risk factors, problem identifier indicators, while analyzing their severity. These insights are displayed as intelligence layers on top of the search results, helping the user to quickly discover the most relevant and important information.
Screenshot: For simple search term “Error” XpoLog present the results set and adds, Semantic markers on the events, Augmentation layer on the time histogram with other related problems and events related to the search result.
Augmented search in action
This Web application scenario will illustrate the power of augmented search. We’re presented with a case where users are complaining the data presented is incorrect or missing when they access their bank accounts. The production support team starts troubleshooting by analyzing the transactions flow and response times to diagnose the root cause of this problem. In this case, they find that it is working properly within the threshold limits across the data center, bringing them to the conclusion that the problem might be an internal application problem.
They must investigate its log files in order to check the application itself. They start by using a broad search term, such as “error”, since they don’t know exactly what to search for. Unfortunately, this search query returned 437,854 events. It would be almost impossible to identify the root cause in a timely manner with simple keyword searches.
But through the Augmented Search layer, the user can immediately see that there are 11 “Connection Problem” high (red) priority events, which were logged by the application. Clicking the “Connection Problem” link returns only 531 events. The difference is dramatic.
The analytics engine automatically marked the events by severity, so the user could immediately see that there was a connection problem to one of the databases in the application. That’s what was causing the missing or false data for many users. Connection issues are resolved simply by restarting the database, solving the problem within minutes instead of hours.
From Web searches to log analysis, big data challenges require solutions that amplify the human intelligence with meaningful insight. Intelligence amplification solutions help bring forward the most relevant information to the user, dramatically shortening the time to resolution by understanding the meaning and context of the information.