Recently, when I was reading up on Cyber Security & Threat Detection, I came across “The Annual Data Breach Report by Verizon”. The report analyzed thousands of such incidents reported by various companies, public & private organizations which happened over the last couple of years. The report analyzed breaches by firmographics, geographies, industries etc. and found that cyber intrusion is a growing threat to every industry based in every country of the world. The report proves time and again that “No single industry or organization in the world is safe from Cyber Threats”. This piqued my curiosity & we felt that we could use all the goodness of data science to effectively tackle this problem. I designed a Threat/Intrusion Detection System, which could be used to detect such data leaks/breaches & take a preventive action to contain, if not stop the damage due to breach.

What is Intrusion Detection System?

Wikipedia accurately defines it as “a device or software application that monitors a network or systems for malicious activity or policy violations”. The detected activities are reported either to an administrator for action or collected centrally using an event management system.
These systems use various methods to track an intrusion or a breach. For eg.

  1. Signature Based Detection – Detection of attacks by looking for specific patterns or sequences in network traffic, systems etc.
  2. Anomaly Detection – Detection of suspicious activity by comparing historical network activity with new activities.

Enablers of Intrusion:
There are multiple reasons/enablers of a data breach. They can be classified into 2 categories:

  • Internal Enablers:
    a) Compromised Actors: Users whose credentials or devices have been compromised or lost
    b) Negligent Actors: Users who expose data accidently by using insecure networks like Wifi etc.
    c) Malicious Insiders: Users who intentionally steal data
  • External Enablers:
    a) Hackers: People who hack an organization’s networks/devices to gain access to sensitive information
    b) Phishing: Another type of hacking intended to gain access to user credentials

System Architecture

The entire system will consist of 6 subsystems/modules in all:

  1. Feature Engineering (which can be supported by frameworks like Kafka)
  2. Text Processing & Topic Modeling
  3. Internal Threat Detection System (Deep Learning based Engine which can be supported by Spark Streaming Framework)
  4. External Threat Detection system (Signature & Anomaly Detection Framework)
  5. Real Time Alert System
  6. Risk Scoring and Reporting

1.  Feature Engineering

An organization’s data can be leveraged to analyze various aspects & behaviors of threat detection.

  • Datasets like HR/Employee data to classify Internal vs External threats
  • Employees personal information to identify the intent & potential impact of a data breach.
  • Online Activities & browsing history to detect external threats.

The various hypothesis that can be created to detect breaches like:

  • Employees with access to larger number of resources/data streams are more likely to be compromised
  • Employees having frequent records of unusual login times and place are a potential threat
  • Employees having bad peer reviews are more likely to have a malicious intent etc.

Feature Extraction is a very important step to successfully implement a deep learning system. A comprehensive list of hypothesis & a detailed exploratory data analysis is a must. A summary of features which I developed is shown below:

2.  Text Processing and Topic Modeling

Data from various sources can be fed into a text processing engine to identify whether the information in any mode of communication is confidential or not.

  1. Topic Identification to identify the topics of conversation occurring within the network.
  2. Attachment Identification to identify whether the attachments contain any sensitive information.
  3. Context Identification to determine whether the parties sharing confidential information have relevant authority & permissions and to identify if the data is related to their work or not.

Below is a working example of how the text processing & classification engine would identify various categories of information based on text processing & origin vs destination of the information.

3.  Deep Learning based Threat Detection System

Current methods & technologies are not efficient at detecting APT’s (Advanced Persistent Threats – mutations of viruses & malware). With the ever changing technology, we are witnessing new ways of intrusion & breaching confidential data. This entails that the system should be self- learning. A Multi-layered deep learning based system could not only help capture complex interactions between variables but it is very robust, scalable & adaptable. All the identified incidents & patterns are denoted by a risk score, to help investigate the breach, control data loss and take precautionary actions for future.

4.  Internal Threat Detection System

The system proposed above could potentially detect a breach within 20 seconds of the event.
Feature creation from the collected event data will happen in real time on a rolling basis. The data is then fed into the deep learning engine discussed above to detect threats. Various policies & pre-decided actions can then be undertaken by the system based on the severity of the detected breach.


5.  External Threat Detection System

The signatures & input nodes for external threat detection are based on files transferred, processes, network accesses, domain & IP, logs, devices etc. A ruleset based framework could help the system to enforce pre-decided action in case of identification of a data breach.

6.  Risk Management

For continuous risk management, a risk score based framework could be developed to identify risky assets & employees and to help mitigate these risks in a timely manner. A risk score could be developed by using all the below attributes to help identify the potential sources of the breach.




We have seen from various data breaches like Yahoo data breach & Sony Email Fiasco that cyber security is a real threat to the identity & reputation of an organization. It’s imperative that adequate security measures should be put in place to mitigate the risks posed by it. Given new modes of communication, more sophistication in malware & new patterns of hacking, a framework like one discussed above could be implemented to prevent theft of digital information.


Aayush Agrawal

Aayush Agrawal

A data science enthusiast and an entrepreneur working on Marketing ROI Measurement using IOT & Big Data solutions.

More Posts

Follow Me: