Web attacks detection using machine learning

  1. 8 weeks ago
    Edited 8 weeks ago by Men in Black

    -image-

    1. Used Datasets

    The used datasets are available in http://www.secrepo.com/self.logs/ . For this experiment, I used October 2016 to January 2016 access logs as training data and February access logs for testing.

    2. Data preparation

    Classification is a machine learning method. Classifiers could be implemented using both supervised and unsupervised learning algorithms. In this article we will be implementing a supervised classifiers which means that they need to be trained with labeled data before using them to make prediction. Thus, training data has to be labeled, and we have to to choose the features we will use for prediction. Data preparation consists of extracting the wanted features from raw http server log files and labeling it using two labels: 1 to say that an unite of data is considered as an attack an 0 for normal behaviors.
    a. Features extraction
    The following features are chosen:

    • HTTP return code
    • URL length
    • Number of parameters in the query

    There features are extracted form the raw log file using the following function which take as input the raw log file name and returns a hash of features.

    #Retrieve data form a a http log file (access_log)
     def extract_data(log_file):
            regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (.+) "(.*?)" "(.*?)"'
            data={}
            log_file=open(log_file,'r')
            for log_line in log_file:
                    log_line=re.match(regex,log_line).groups()
                    size=str(log_line[4]).rstrip('\n')
                    return_code=log_line[3]
                    url=log_line[2]
                    param_number=len(url.split('&'))
                    url_length=len(url)
                    if '-' in size:
                            size=0
                    else:
                            size=int(size)
                    if (int(return_code)>0):
                            charcs={}
                            charcs['size']=int(size)
                            charcs['param_number']=int(param_number)
                            charcs['length']=int(url_length)
                            charcs['return_code']=int(return_code)
                            data[url]=charcs
            return data

    b. Data labeling
    Labeling consists of attributing a label for each unit of data, this label will indicate if the concerned log line is considered as an attack or not. Labeling should normally be done manually by experimented security engineer, in this example it is done automatically using a function that looks for specific patterns in each URL and decide if it is about an attack. This automation is done just for experimental purpose, it will be really better if you label your data manually.
    The labeling function is the following:

    def label_data(data,labeled_data):
            for w in data:
                    attack='0'
                    patterns=['honeypot','%3b','xss','sql','union','%3c','%3e','eval']
                    if any(pattern in w.lower() for pattern in patterns):
                            attack='1'
                            data_row=str(data[w]['length'])+','+str(data[w ]  ['param_number'])+','+str(data[w]['return_code'])+','+attack+','+w+'\n'
                            labeled_data.write(data_row)
            print str(len(data))+' rows have successfully saved to '+dest_file

    b. Ready data example
    The following is a sample of data ready to be used to train our classifiers. Have a look on the legend to have a clearer idea.

    36,1,200,1,GET /self.logs/?C=D%3BO%3DA HTTP/1.1
     47,1,200,0,GET /self.logs/error.log.2016-11-05.gz HTTP/1.1
     48,1,200,0,GET /self.logs/access.log.2015-12-19.gz HTTP/1.1
     38,1,404,0,GET /access.log.2015-03-03.gz HTTP/1.1
     47,1,200,1,GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.1
     47,1,200,1,GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.0
     48,1,200,0,GET /self.logs/access.log.2015-06-26.gz HTTP/1.1
    
     47,1,200,0,GET /self.logs/error.log.2016-03-09.gz HTTP/1.1
    
     LEGEND:
    
     SIZE
     PARAMETERS NUMBER
     HTTP RETURN CODE
     LABEL (1: attack, 0: no attack)

    3. Decision Tree Classifier

    Decision Tree is a classification algorithm. Like all classifier, decision tree needs real world training data to make prediction. The data we prepared using the function described in the previous sections will be used as training data.
    Testing data is generated with the same function.

    In this experiment I used an implementation of Decision Tree available in Sklearn Python Machine Learning library. Here is the implementation of the Decision Tree classifier:

    from utilities import *
    
     #Get training features and labeles
     training_features,traning_labels=get_data_details(traning_data)
    
     #Get testing features and labels
     testing_features,testing_labels=get_data_details(testing_data)
    
     ### DECISON TREE CLASSIFIER
     print "\n\n=-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-\n"
    
     #Instanciate the classifier
     attack_classifier=tree.DecisionTreeClassifier()
    
     #Train the classifier
     attack_classifier=attack_classifier.fit(training_features,traning_labels)
    
     #get predections for the testing data
     predictions=attack_classifier.predict(testing_features)
    
     print "The precision of the Decision Tree Classifier is:  "+str(get_occuracy(testing_labels,predictions,1))+"%"

    4. Logistic Regression Classifier

    Like Decision Tree, Logistic regression is implemented in this experiment using Sklearn:

     from utilities import *
    
     #Get training features and labeles
     training_features,traning_labels=get_data_details(traning_data) 
    
     #Get testing features and labels   
     testing_features,testing_labels=get_data_details(testing_data)
    
     ### LOGISTIC REGRESSION CLASSIFIER
     print "\n\n=-=-=-=-=-=-=- Logistic Regression Classifier -=-=-=-=-=-\n"
    
     attack_classifier = linear_model.LogisticRegression(C=1e5)
     attack_classifier.fit(training_features,traning_labels)
    
     predictions=attack_classifier.predict(testing_features)
     print "The precision of the Logistic Regression Classifier is: "+str(get_occuracy(testing_labels,predictions,1))+"%"

    5. Utilities

    The functions get_data_details() and get_occuracy() used in both Logistic Regression and Decision Tree are implemented in a separate file: https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning/blob/master/utilities.py .

    6. Testing and comparing the precision

    root@enigmater:~/intrusion-detection-with-machine-learning$ python ./decision-tree-  classifier.py ./labeled-data-samples/learning_data.csv ./labeled-data -samples/jan_2017_labeled_features.csv
    
     =-=-=-=-=-=-=- Decision Tree Classifier -=-=-=-=-=-=-=-
    
     Real number of attacks:43.0
     Predicted number of attacks:32.0
     The precision of the Decision Tree Classifier is: 74.41%
    
     root@enigmater:~/intrusion-detection-with-machine-learning$ python ./logistic-regression- classifier.py ./labeled-data-samples/learning_data.csv ./labeled-data -samples/jan_2017_labeled_features.csv 
    
     =-=-=-=-=-=-=- Logistic Regression Classifier -=-=-=-=-=-
    
     Real number of attacks:43.0
     Predicted number of attacks:5.0
     The precision of the Logistic Regression Classifier is: 11.62 %

    7. Source code

    All the source code and some testing data are available in https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning .

 

or Sign Up to reply!