Identifying Botnet on IoT by Using Supervised Learning Techniques

The security challenge on IoT (Internet of Things) is one of the hottest and most pertinent topics at the moment especially the several security challenges. The Botnet is one of the security challenges that most impact for several purposes. The network of private computers infected by malicious software and controlled as a group without the knowledge of owners and each of them running one or more bots is called Botnets. Normally, it is used for sending spam, stealing data, and performing DDoS attacks. One of the techniques that been used for detecting the Botnet is the Supervised Learning method. This study will examine several Supervised Learning methods such as; Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, kNearest Neighbors, Random Forest, Gradient Boosting Machines, and Support Vector Machine for identifying the Botnet in IoT with the aim of finding which Supervised Learning technique can achieve the highest accuracy and fastest detection as well as with minimizing the dependent variable. CONTACT Amirhossein Rezaei ahr338@gmail.com Advanced Informatics School, Level 5, Menara Razak, Universiti Teknologi Malaysia, Jalan Sultan Yahya Petra, 54100 Kuala Lumpur, Malaysia. © 2019 The Author(s). Published by Oriental Scientific Publishing Company This is an Open Access article licensed under a Creative Commons license: Attribution 4.0 International (CC-BY). Doi: http://dx.doi.org/10.13005/ojcst12.04.04 Article History Received: 16 August 2019 Accepted: 21 October 2019


Introduction
The Internet of Things (IoT) might be thought of as the designation of Ubiquitous Computing. In IoT, the user environment is replete of gadgets working cooperatively. 1, 2, 3 These gadgets go from computing elements, for example, RFID tags and biochip on ranch animals to smartphones and machines with implicit sensors. Types of gear like these are known by the name of "things". In any case, programming these frameworks is additionally challenging, due, not only to their sheer volume but also to their diversity. 4 Botnet attacks have destroyed the effect on public and private frameworks. The botmasters controlling these systems mean to counteract bring down endeavors by utilizing very versatile peer to peer overlays to lay hold of their botnets, and even solidify them with countermeasures against intelligence gathering endeavors. Ongoing research demonstrates that advanced countermeasures can hamper the capacity to accumulate the fundamental intelligence for bringing down botnets. 5 On the other hand, adapting to malware is getting increasingly challenging, given their constant development in intricacy and volume. A standout amongst the most widely recognized methodologies in writing is utilizing machine learning methods, to simply learn models and patterns behind such intricacy, and to create methods to keep pace with malware development. 6 Supervised Learning is one of the methods under Machine learning that it is on the application layer. Application layer and protocol layer are two layers under passive monitoring which is one of the methods under Network based on Anomaly-based that it is part of Intrusion Detection System classification (IDS). The Supervised Learning techniques also called the Classification methods is a type of technique that includes the target and dependent variable that is to be forecasted from a given set of independent variables. Therefore, it can produce the function that maps inputs to needed outputs using those sets of variables (training data). The training process will be continued until the model reaches the desired level of accuracy on training data. 7 According to our previous study, it reviewed some recent studies that have been done on a Machine learning algorithm to identify Botnet. On supervised learning methods, the statistical foundation like hypothesis representation, concerns about the relationship between the features "x" and target "y". That should be defined via the selected features as well as accepts some ample information regarding what that action is similar in the way to perfectly demonstrate the activities of bots using supervised learning methods. Detecting bots based on certain well-known and particular features have been used in supervised learning methods. The accuracy of supervised learning methods could be effective against bot traffic which seeks for covering up itself between legitimate traffic by given certain particular malicious traffic features. However, most of the supervised learning methods have a common trend also separately from particular perceptions toward traffic of bot shown in the feature space, supervised learning techniques accomplish poorly.
Supervised learning techniques might overcome the secret nature of bots. Supervised learning techniques are worked for cases that certain particular characteristic is known. 7 This research will study the several Supervised Learning methods which are; Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, k-Nearest Neighbors, Random Forest, Gradient Boosting Machines, and Support Vector Machine for identifying the Botnet in IoT. This study aims to find which Supervised Learning technique can achieve the highest accuracy and fastest detection as well as with minimizing the dependent variable.

Data Sources & Instrumentation
To evaluate the classification performance of bots and botnet domain names using machine learning techniques, the extracted labeled domain name datasets that include the set of normal domain names and malicious domain names that have been used via Botnets will be used in this research. The set of normal domains name includes 790,745 domain names. Normal domain names will be checked at www.virustotal.com to verify that they are a normal domain. On the other hand, the set of malicious domain names (total number of 199,772) collected from three different sources which are: 1-www.cert. at 2-www.github.com and 3-www.kaggle.com. All domain names including the normal and malicious domain names have 204 variables such as the IP address, port number, SRC port, DST port, packet pay size, idle time max, HTTP response status code, packet header size, payload bytes max, HTTP request version, packets ack avg, packet direction, and so on. This research will firstly select five training datasets randomly which include both normal domain names and malicious domain names from data sources mentioned above. After that, one testing datasets which include both normal and malicious domain names from data sources but not from those which was selected for training datasets will be selected. In this study, a few classification measures such as False Positive (FP), True Positive (TP), False Negative (FN), True Negative (TN), and accuracy (ACC) will be used. After that will be used the suggested Supervised Learning algorithms on those training datasets and testing dataset to identify the best one in terms of accuracy and performance.

Implementation
After setting up the training dataset and testing dataset we have done analyzing individual feature statistics of variables and finding the correlation and apparent relationships between the variables to minimizing the number of variables that needed for identifying the Botnet without reducing the accuracy as well as avoiding increasing the duration of detection. We narrow it down from 204 variables to only 20 variables. Then we examine it on selected Supervised Learning methods to find out which one has the highest accuracy with the lowest time duration.

Linear Regression
One of the oldest and most widely used analytics methods is the technique of regression. The objective of regression is to deliver a model that represents the 'best fit' to some observed data. Commonly the model is a capacity depicting some kind of bend (lines, parabolas, and so on.) that is dictated by a lot of parameters (e.g., slope and intercept). "Best fit" implies that there is an ideal arrangement of parameters as indicated by an evaluation criterion we pick. A regression model endeavors to foresee the estimation of one variable, known as the dependent variable, response variable or label, utilizing the values of different variables, known as independent variables, explanatory variables or features. Single regression has one label used to foresee one feature. Multiple regression utilizes at least two feature variables. 8 Based on the apparent relationships identified when analyzing the data, a linear regression model was created to predict the value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated. The model was trained with 80% of the data and tested with the remaining 20%. The Root Mean Square Error (RMSE) for the test results is 0.3878and the standard deviation of the label is 0.1066. Also, to make all features standardization, we used the standard scaler to do standard normally distributed data. Since the model predicts log of the label, the results are is in float not in integer and due to the domain names are either normal domain (0) or malicious domain (1). To make it more accurate we round the predicted values to an integer. Indicating that there is a model performs reasonably well. The predicted label is converted back to its exponential value (the rounded label), which leads to fit well with The RMSE of 0.4378 and a standard deviation of 0.1270 and an accuracy of 0.8090 (80.90%). The True Positive (TP) 1.30%, True Negative (TN) 79.60%, False Positive (FP) 0.21 %, and False Negative (FN) 18.89% is after rounding the results which mean achievement of 80.90%. A scatter plot showing the predicted log label and the actual log label is shown in Figure 1.

Logistic Regression
Logistic Regression is a classification, not a regression algorithm. It is utilized to evaluate discrete values (Binary values like 0/1, yes/no, true/ false) dependent on a given arrangement of the independent variable(s). In straightforward words, it predicts the likelihood of an event of an occasion by fitting information to a logit function. Thus, it is otherwise called logit relapse. Since it predicts the likelihood, its yield esteems lies somewhere in the range of 0 and 1. 9 Based on the apparent relationships identified when analyzing the data, the Logistic Regression model was created to predict the value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated.  Figure 3.

Fig. 2: Logistic Regression histogram actual vs. predicted values Decision Tree
Decision Tree is a sort of supervised learning calculation that is for the most part utilized for classification issues. Incredibly, it works for both categorical and continuous dependent variables.
A decision tree is a choice help device that utilizes a tree-like model of choices and their conceivable results, including chance occasion results, asset expenses, and utility. It is one approach to show a calculation that just contains restrictive control explanations. Decision trees are generally utilized in operations research, explicitly in a choice examination, to help recognize a methodology well on the way to achieve an objective, but on the other hand, are a popular tool in machine learning. 10 Based on the apparent relationships identified when analyzing the data, a Decision Tree model was created to predict the value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated. The model was trained with 80% of the data and tested with the remaining 20%. The RMSE for the test results is 0.0477 and the standard deviation of the label is 0.4010. Also, to make all features standardization, we used the standard scaler to do standard normally distributed data.
Indicating that there is a model performs reasonably

Naive Bayes
Naive Bayes is a classification method dependent on Bayes' hypothesis with an assumption of independence between predictors. In basic terms, a Naive Bayes classifier expects that the nearness of a specific feature in a class is unrelated to the nearness of some other element. For instance, a fruit might be viewed as an apple in the event that it is red, round, and around 3 inches in distance across. Regardless of whether these highlights rely upon one another or upon the presence of alternate highlights, a naive Bayes classifier would consider these properties to independently add to the likelihood that this fruit product is an apple. Naive Bayes model is anything but difficult to construct and especially useful for very large data sets. Alongside effortlessness, Naive Bayes is known to outperform even highly sophisticated classification methods. 11 Based on the apparent relationships identified when analyzing the data, the Naive Bayes model was created to predict the log-normal value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated. showing the predicted log label and the actual log label is shown in Figure 5. it is all the more generally utilized classification problems in the business. K nearest neighbors is a straightforward algorithm that stores every available case and classifises new cases by a majority vote of its k neighbors. The case being doled out to the class is most common among its K nearest neighbors estimated by a distance function. k-NN calculation is a non-parametric technique utilized for classification and regression. In the two cases, the information comprises of the k nearest preparing models in the feature space. The output relies upon whether k-NN is utilized for classification or regression. In k-NN classification, the yield is class participation. An item is classified by a majority vote of its neighbors, with the object being given to the class most basic among its k closest neighbors (k is a positive integer, normally small). If k = 1, at that point the item is essentially allocated to the class of that single nearest neighbor. In k-NN regression, the output is the property estimation for the object. This value is the average of its k nearest

Random Forest
Random forest or random decision forest is a group learning strategy for classification, regression and different undertakings that works by building a large number of decision trees at training time and yielding the class that is the method of the classes (classification) or mean prediction (regression) of the individual trees. Random choice forests correct for decision trees' practice for overfitting to their training set. Random Forest is a trademarked term for a group of decision trees. In Random Forest, we have an accumulation of decision trees (so-known as "Forest"). To order another item dependent on characteristics, each tree gives a classification and we state the tree "votes" for that class. The forest picks the classification having the most votes (over every one of the trees in the forest). 13 Based on the apparent relationships identified when analyzing the data, a random forest model was created to predict the log-normal value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated. The model was trained with 80% of the data and tested with the remaining 20%. The RMSE for the test results is 0.0422 and the standard deviation of the label is 0.0. Also to make all features standardization, we used the standard scaler to do standard normally distributed data. Indicating that there is a model performs reasonably well. The result leads to fit 100% with an accuracy of 0.9982 (99.82%). The True Positive (TP) 20.09%, True Negative (TN) 79.73%, False Positive (FP) 0.04%, and False Negative (FN) 0.14% is after rounding the results which mean achievement of 99.82% or on the other hand 0.18% detecting wrongly. A scatter plot showing the predicted log label and the actual log label is shown in Figure 6.
learning calculations which consolidates the forecast of a few base estimators to improve power over a single estimator. It joins numerous feeble or normal indicators to build a strong predictor. 14 Based on the apparent relationships identified when analyzing the data, a Gradient Boosting Machines model was created to predict the log-normal value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated.  Figure 7. portrayal of the precedents as focuses in space, mapped with the goal that the instances of the different classes are separated by a reasonable hole that is as wide as could be expected under the circumstances. New precedents are then mapped into that equivalent space and anticipated to have a place with a class dependent on which side of the hole they fall. 15 Based on the apparent relationships identified when analyzing the data, the SVM model was created to predict the log-normal value for the label (either they are normal domain names or malicious domain names), from which the predicted label can be calculated. The model was trained with 80% of the data and tested with the remaining 20%. However, we run for one week but the processing was not done therefore we reduced the training data to 10% and also used Dimensionality Reduction Algorithm (PCA) to reduce the number of random variables under consideration by obtaining a set of principal variables. Nevertheless, after done all still could not get the results in a reasonable time. With the consideration from the above data collected from this research, even though the Random Forest has the highest accuracy rate (99.89%) but the duration taken from run the code to get the result was high (84 seconds and 72 milliseconds) while the Decision Tree has done in just 31 seconds and 10 milliseconds with only 0.06% lower accuracy than the Random Forest and since the speed of detecting is one of this research aim, the Decision Tree is the best to identifying the Botnets and bots in the Internet of Things with high accuracy (99.76%) on lowest time duration from run the code to get the result among of supervised learning methods which studied in this research.

Aknowledgement
As corresponding author I, Amirhossein Rezaei, hereby confirm on behalf of all authors that: 1. This manuscript, or a large part of it, has not been published, to any other journal.

2.
All text and graphics, except for those marked with sources, are original works of the authors, and all necessary permissions for publication were secured prior to submission of the manuscript.

3.
All authors each made a significant contribution to the research reported and have read and approved the submitted manuscript.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

Conflict of Interest
We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work.
[OR] We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.