For me, the term "artificial intelligence" is strongly bound to Arnold Schwarzenegger trying first to ruin and later save the world as The Terminator. More than 30 years have passed since the film, and the term AI is a buzz word once again. But how do we separate fact from fiction?
Wikipedia describes AI as "intelligence exhibited by machines" -- not the most informative. Most people today divide the AI field into subfields like reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects -- in other words, everything one would expect from the Terminator.
However, AI has much more down-to-earth applications, such as the visual object recognition used in Google Photos auto-tagging, which allows you to search your photos for sunset pictures, or the speech recognition and natural language processing used in Siri and Alexa personal assistants to understand and provide useful answers to your questions.
What's machine learning?
As opposed to the fascinating science-fiction world of AI, machine learning is a much geekier domain. Machine learning is the process of getting computers to act without being explicitly programmed. When a human writes a conventional program and supplies parameters, the program processes input, sometimes using parameters supplied by the human, and generates output. In machine learning, the human builds a parametrized model and writes a program that implements this model and the program finds the "right" parameters to provide a proper solution and integrates these into the model.
Machine learning reach
While the definition of machine learning focuses on the technical details of how machine learning solutions work, the more interesting perspective is the variety of problems machine learning approaches can solve. With ever increasing computational power available, extending the applicability of the computationally heavy machine learning algorithms from supercomputers to PCs, laptops and even low-end smartphone devices, machine learning approaches can be effective in almost every domain of computational problems.
Machine learning algorithms
Machine learning algorithms are divided into several classes including -- supervised, unsupervised and reinforcement learning.
Supervised learning includes significant human guidance. For example, in labeling problems (also known as classification problems), the algorithm trains the model with a labeled dataset which is a collection of data samples with each sample associated with a label or class. The outcome of this training process is a classification model that can tell the classes of new data samples.
In unsupervised learning, the human does not label anything, in which case the range of problems the algorithm can solve is significantly reduced to division into groups (clustering), figuring out relations between features of the data samples (dimensionality reduction), estimation of how close data samples are to each other (density estimation) and finding data samples that seem odd and don't fit into any patterns (outlier detection).
In reinforcement learning, the human user provides feedback for the correctness of the labeling done by the algorithm.
In cybersecurity, supervised and unsupervised, in their different variants can detect threats and react appropriately to them. See images below for more information on each works.
Left: Supervised. Right: Unsupervised. Click images to enlarge.
Challenges of AI applications in cybersecurity
Implementing a threat detection model can be challenging. Let's take the example of website traffic and a cybersecurity expert building a threat detection tool, either in the supervised model with classifiers for good and bad samples or in the unsupervised model by building a normality model for good samples. Upon building the model, you observe the traffic to the site to detect how users are accessing pages, and then you use the model to determine whether an access is an attack or not.
For sites that serve human users, you should remember that humans are unpredictable in how they're going to interact with the site, even though a large percentage of the site traffic is likely from bots. (Further info on bots is available in the Imperva Incapsula Bot Traffic Report 2016.)
The problem becomes even bigger when it comes to malicious traffic.
The adversarial model – where do random bits come from?
Last in the list of challenges, is the adversarial model. The term adequate represents a low probability of false positives and low probability of false negatives. Simple, right? Not necessarily.
With conventional machine learning settings, the answer is simple. The randomness stems from the world that generates the data samples you see, which under reasonable statistical assumptions will behave like the world that generated the data samples that trained the algorithm. However, in our cybersecurity example, this translates to the natural distribution of the benign and malicious users accessing the site. This may be reasonable when the attacker is a bot that repetitively launches the same set of attack vectors on the site, regardless of whether the attack succeeds or not, but a real human attacker is likely to try and adapt the attack to the site and its protection mechanisms.
— Itsik Matin, director of security research, Imperva