Intelligence and Security Informatics Data Sets
AZSecure-data.org
Data Infrastructure Building Blocks for ISI. A Project of the University of Arizona (NSF #ACI-1443019), Drexel University,
University of Virginia, University of Texas at Dallas, and University of Utah
ISI Research and
Analysis Tool Inventory
Introduction
Security researchers may face steep learning curves when attempting to identify tools that can aid them in developing valuable security insights from data sets. This document provides a summary of tools that can aid researchers in performing data driven security analytics. The presented tools are not exhaustive of all tools that currently exist in the data analytics landscape. Rather, they reflect the tools used in the University of Arizona’s Artificial Intelligence Lab’s past security informatics research. We organize the tools into three major sections based on a traditional data analytics pipeline: (1) collection and storage tools; (2) pre-processing and analytics tools; and (3) visualization tools. For each category, we provide a short summary of the typical types of tasks that are completed in that phase of the data analytics procedure followed by an inventory of tools that fall into that category. We provide the name of the tool, a link of where to download and get documentation for each tool. Note that researchers can select one of the tools with similar functionalities based on personal preference (such as WEKA vs. RapidMiner). We also select a set of ISI papers which have used the listed tools.
Collection and Storage Tools
The collection and storage component of relevant data is the first stage in typical data analytics exercises. Data collection aims to identify and capture relevant fields of data from a specific source (e.g., web forums, Twitter, etc.) and index and store it in a database or some other format which can be can be retrieved and used for pre-processing and further analytics. This section details some of the packages and tools that can be used to collect and store data. On a high level, the collection process comprises three steps to pull from the online sources into the database: extract, transform, and load (ETL).
Collection Process:
Extract
The first part of the collection process involves extracting the data from the online source. Depending on the source system, different techniques can be used for extraction. Some sources may also have anti-crawling measures built in. We provide several techniques and strategies to counter some of these measures.
Spidering tools
Tool |
Details |
Offline Explorer |
Offline Explorer (OE) Pro is a useful tool we use for collecting forum and other web contents. OE provides a very useful GUI for creating and scheduling various crawling projects, built-in support for completing HTML login forms, and even supports routing traffic through proxy servers and the Tor network. |
cURL |
cURL is a tool to transfer data from or to a server, using one of the supported protocols. cURL offers a busload of useful tricks like proxy support, user authentication, FTP upload, HTTP post, SSL connections, cookies, file transfer resume, Metalink, and more. Link: https://curl.haxx.se/ |
Wget |
Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Its features include recursive download, conversion of links for offline viewing of local HTML, and support for proxies. |
Packages for Customized Spiders
Package |
Programming Language |
Details |
HtmlUnit |
Java |
HtmlUnit is a headless web browser written in Java. It allows high-level manipulation of websites written in Java code, including filling and submitting forms and clicking hyperlinks. It also provides access to the structure and the details within received web pages. HtmlUnit emulates parts of browser behaviour including the lower-level aspects of TCP/IP and HTTP. This headless browser can deal with HTTPS security, basic HTTP authentication, automatic page redirection and other HTTP headers. |
Selenium |
Python |
Selenium is a browser automation library. Selenium may be used for any task that requires automating interaction with the browser. Selenium makes direct calls to the browser using each browser’s native support for automation. |
Counter Anti-crawling Techniques
Anti-crawling Measure |
Description |
Counter-measure |
User-agent Check |
Shops verify the HTTP request comes from a legitimate user-agent (browser.) |
Use packages that mimics the behavior of mainstream browsers. |
User/password Authentication |
Shops requires users to register and login before accessing the data. CAPTCHA is widely used to verify the user inputting the credential is a human-being. |
Login the shop first and extract the corresponding cookies. With these cookies carried with HTTP request, we can bypass the login process. |
Session Timeout |
Shops automatically logout users that have been in the shop for too long. |
Need human involvement to acquire and deploy renewed cookies. |
IP Check |
CloudFlare verifies the HTTP request comes from a legitimate IP address rather than a public known proxy, such as Tor. |
Setup a private, dedicated proxy server to reroute our connections. The proxy server can be deployed in Digital Ocean as it is easily to deploy a new IP after the first IP is banned. |
DDoS Prevention |
CloudFlare detects possible DDoS signs and bans the suspicious IP address. |
Set intervals between two successive requests; allow the private proxy server to change IP addresses easily |
Collection Process:
Transform
The second part of the collection process involves transforming the raw data into target data elements. These tools help parse the target data elements from the raw collected data, especially web pages.
Tool |
Details |
Regex |
A regular expression, regex or regexpx is a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings. |
JSoup |
JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Link: https://jsoup.org/ |
BeautifulSoup |
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. |
urllib |
This module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames. Some restrictions apply — it can only open URLs for reading, and no seek operations are available. |
Collection Process:
Load
The last part of the collection process involves loading the data into the data warehouse. Here are a list of common data warehouse implementations and their associated documentation.
Implementation |
Details |
MySQL |
MySQL is an open-source relational database management system (RDBMS). Link: https://www.mysql.com/ |
MS SQL Server |
Microsoft SQL Server is a relational database management system developed by Microsoft. Link: https://www.microsoft.com/en-us/sql-server/sql-server-2016 |
Oracle Database |
Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation. |
Apache HBase |
Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. Apache HBase provides Bigtable-like capabilities on top of Hadoop. Use Apache HBase when you need random, real time read/write access to your Big Data. |
Apache Hive |
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Link: https://hive.apache.org/ |
MongoDB |
MongoDB (from humongous) is a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemas. Link: https://www.mongodb.com/ |
Apache Lucene |
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. |
Pre-Processing and Analytics Tools
Before data can be analyzed, it often has to be pre-processed and transformed into a format which is conducive for analysis. Such a process often consumes the majority (70-75%) of the time in data analytic projects. Pre-processing tasks include, but are not limited to, cleaning, normalizing, transforming, tokenizing, extracting features, tagging parts of speech, etc. While custom scripts are often required in pre-processing, there are some general purpose tools that can help convert data into usable formats for analytics.
Once data has been pre-processed and converted into a format appropriate for analysis, the third phase in the data analytics pipeline focuses on analyzing the data to derive useful and interesting insights. Past security analytics research has employed dozens of analytical techniques ranging from simple summary statistics to complex algorithms such as deep learning. This results in a large range of tools that can be applied for security analytics. Many common data mining algorithms (e.g., SVM, Naive Bayes, k-means, regression, etc.) and general text mining applications (Named Entity Recognition, coreference resolution, etc.) are bundled into single packages such as WEKA or Natural Language Toolkit. However, there are various analytical approaches (e.g., hidden markov models, conditional random fields, social network analysis, etc.) that are not currently part of any general toolset, but part of a more specialized package. Those tools are also listed.
Tool Type |
Tool Name |
Programming
Language |
URL for
Documentation |
Notes |
General Data Mining |
WEKA |
Java, GUI |
One-stop tools that cover popular pre-processing, classification, and clustering algorithms. RapidMiner and WEKA can be used independently without a specific programming language. |
|
Scikit-Learn |
Python |
|||
RapidMiner |
GUI |
|||
R |
R |
https://www.r-project.org/ |
A widely used programming language and software environment for statistical computing and graphics. Various data pre-processing and analytics tools are supported by packages. |
|
General Text Mining |
Natural Language Toolkit (NLTK) |
Python |
One-stop tools that cover word/sentence tokenization, POS tagging, parsing, chunking, named entity recognition, etc. NLTK has interfaces to call Stanford NLP tools. |
|
Stanford CoreNLP |
Java |
|||
Apache OpenNLP |
Java |
|||
Sentiment Analysis |
SentiStrength |
Java |
Estimates the strength of positive and negative sentiment in short texts. |
|
Ontologies |
WordNet |
- |
English lexical database grouped into synonyms. |
|
SentiWordNet |
- |
Tagged WordNet with positivity, negativity, and neutrality for opinion mining. |
||
Hidden Markov Models (HMM) |
hmmlearn |
Python |
General HMM package |
|
NLTK |
Python |
Specialized in POS tagging |
||
Conditional Random Fields (CRF) |
Stanford NLP Group |
Java |
Stanford NER CRF has a CRF implementation |
|
CRF++ |
C++ |
General CRF package |
||
NLTK |
Python |
Specialized in POS tagging, referring to a package pycrfsuite |
||
Latent Dirichlet Allocation (LDA) |
Mallet |
Java |
Command line based tool that can perform standard LDA |
|
Stanford Topic Modelling Toolbox |
GUI |
GUI based tool that supports LDA, labelled LDA, partially labelled LDA, and calculating perplexity. Can also perform temporal LDA |
||
Gensim |
Python |
Allows users to perform Latent Semantic Analysis and LDA using Python. Useful when integrating LDA with other applications in Python |
||
Social Network Analysis |
UCINET |
GUI |
Licensed software (minimum $40) that can handle medium sized networks (2 millions nodes max) |
|
Gephi |
GUI |
Open source GUI based software that can handle larger data sizes than UCINET. Can read directly from databases |
||
NetworkX |
Python |
Python based network analysis tools. Can read from a variety of data sources. Allows for significant customization compared to other tools |
||
Word2vec |
Gensim |
Python, C |
Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. |
|
DL4J |
Java, Scala |
|||
Deep Learning |
Keras |
Python |
High-level neural networks library running on top of either TensorFlow or Theano. Recommended for fast experimentation. |
|
TensorFlow |
Python, C++ |
Low-level implementation for deep learning models |
||
Theano |
Python |
Low-level implementation for deep learning models |
Visualization Tools
The final stage in the data often incorporates a visualization component, where researchers will utilize various tools to create diagrams. Desktop software provide turnkey solutions to manage, connect, pivot data and render predefined types of visualizations in the GUI. For better customizability, lightweight toolkits, packages, and online services can be implemented along with analytical scripts.
Desktop Visualization
Software
Tool |
Cost |
Descriptions |
Tutorials |
Microsoft Excel |
License Required |
Excel supports charts, graphs, generated from specified groups of cells. Excel 2010 and later support Pivot Table, which enables geo-map plotting as well as interactive visualizations. |
|
Tableau |
Free Education License |
Tableau queries relational databases, cubes, cloud databases, and spreadsheets and then generates a number of graph types that can be combined into dashboards and shared over a computer network or the internet. |
|
ParaView |
Free, Open-source |
Users can quickly build visualizations to analyze their data using qualitative and quantitative techniques. The data exploration can be done interactively in 3D or programmatically using its batch processing capabilities. ParaView was developed to analyze extremely large datasets using distributed memory computing resources. |
Lightweight Toolkits,
Packets, and Online Services
Tool Type |
Tool Name |
Programming
Language |
URL for
Documentation |
General Data Visualization Toolkits |
General data visualization toolkits enabled users to customize their visualization components (e.g., point, line, axes, legends, data layout, color coding) programmatically. Matplotlib, Seaborn, pandas, ggplot2 provide basic visualization templates (e.g., scatterplot, bar chart) for fast visualization implementation. |
||
Visualization Toolkit (VTK) |
C++, Python, Java |
||
OpenFrameworks (OF) |
C++ |
||
Processing |
Java, Python, Javascript |
||
Matplotlib |
Python |
||
Seaborn |
Python |
||
pandas |
Python |
||
ggplot2 |
R |
||
Word Cloud |
Word cloud is a graphical representation of word frequencies. It can be used to visualize most frequently used keywords in the corpus. |
||
Wordle |
Online, Javascript |
||
Geo-Map Tools |
When location data (e.g. state, zipcode, latitude and longitude) is available, these geo-map tools can help you layout the data onto a map and generate visualizations such as color map, flow maps, etc. |
||
Mapbox |
Online, Javascript |
||
geoplotlib |
Python |
||
choroplethr |
R |
||
Network Visualization Tools |
Network visualization tools can visualize the relationship between data attributes or different data sources. The built in layout algorithms automatically generate visually pleasing graphs. |
||
Gephi |
GUI, Java |
||
networkx |
Python |
||
graph-tool |
Python |
||
igraph |
R |
||
Front-end Visualization Tools |
These tools provides solutions to embed static/interactive visualization on the webpage. Predefined templates are available so they are light-weight design tools compared with general visualization toolkits. |
||
D3.js |
Javascript |
||
Google Chart |
Javascript |
||
R (googleVis) |
https://cran.r-project.org/web/packages/googleVis/vignettes/googleVis_examples.html |
||
Datawrapper |
Online, Javascript |
||
Infogram |
Online, Javascript |
||
Plotly |
Online, Javascript, R, Python |
||
Interactive Visualization Tools |
Interactive visualization tools support user interactions such as highlighting, zooming, and panning. Interaction visualization is a good way to present data with different granularities of details or with time-series changes. |
||
Bokeh |
Python |
http://bokeh.pydata.org/en/latest/docs/user_guide.html#userguide |
|
ggvis |
R |
||
visNetwork |
R |
||
Color Selection (Aesthetic) |
These color selection tools helps to improve the aesthetic of the visualization. They also provide safe color selections for web presenting, printing, color-blind cases. |
||
Color Brewer 2 |
Online |
||
Palettable |
Python |
||
RColorBrewer |
R |
https://cran.r-project.org/web/packages/RColorBrewer/index.html |
Example ISI Papers
To show the research context of applying the listed tools, we reviewed research papers from 2016 and 2015 IEEE ISI (56 and 47 papers respectively), 2016 FOSINT-SI (8 papers), and 2015 ISI-ICDM (10 papers). Following the structure of this document, tools are categorized into collection, storage, pre-processing, analytics, and visualization tools. We selected representative papers to show how those tools can be used together to support research. Note that around 70 percent of the papers we reviewed did not specify the tools they used, especially for storage and visualization, or only mentioned the techniques instead of the tools for implementation.
Paper |
Collection and Storage |
Pre-Processing and Analytics |
Visualization |
Samtani et al. (2016) |
Offline Explorer, MySQL, Regex |
RapidMiner, Stanford Topic Modelling Toolbox |
Tableau, D3.js |
Grisham et al. (2016) |
Selenium, MySQL |
Stanford Topic Modelling Toolbox |
- |
Benjamin & Chen (2016) |
Offline Explorer, MySQL, Regex |
Word2vec |
- |
Benjamin & Chen (2014) |
IRC Bots |
WEKA |
- |
Samtani & Chen (2016) |
Offline Explorer, MySQL, Regex |
Gephi |
Gephi |
Solaimani et al. (2016) |
MongoDB |
CoreNLP, WordNet |
- |
Dobolyi & Abbasi (2016) |
PhishTank API, Wget |
R |
R |
Andrew Park et al. (2016) |
SQLite |
Apache OpenNLP, SentiStrength |
- |
References
1.
Samtani, S., & Chen, H. (2016, September). Using social
network analysis to identify key hackers for keylogging tools in hacker forums.
In Intelligence and Security Informatics (ISI),
2016 IEEE Conference on (pp. 319-321). IEEE.
2.
Grisham, J., Barreras, C., Afarin, C., Patton,
M., & Chen, H. (2016, September). Identifying top listers
in Alphabay using Latent Dirichlet
Allocation. In Intelligence and Security
Informatics (ISI), 2016 IEEE Conference on (pp. 219-219). IEEE.
3.
Samtani, S., & Chen, H. (2016, September). Using social
network analysis to identify key hackers for keylogging tools in hacker forums.
In Intelligence and Security Informatics
(ISI), 2016 IEEE Conference on (pp. 319-321). IEEE.
4.
Samtani, S., Chinn, K., Larson, C., & Chen, H. (2016,
September). AZSecure Hacker Assets Portal: Cyber
threat intelligence and malware analysis. In Intelligence and Security Informatics (ISI), 2016 IEEE Conference on
(pp. 19-24). IEEE.
5.
Benjamin, V., & Chen,
H. (2016, September). Identifying language groups within multilingual
cybercriminal forums. In Intelligence and
Security Informatics (ISI), 2016 IEEE Conference on (pp. 205-207). IEEE.
6.
Dobolyi, D. G., & Abbasi, A. (2016,
September). PhishMonger: A free and open source
public archive of real-world phishing websites. In Intelligence and Security Informatics (ISI), 2016 IEEE Conference on
(pp. 31-36). IEEE.
7.
Solaimani, M., Salam, S., Mustafa, A. M., Khan, L., Brandt, P. T.,
& Thuraisingham, B. (2016, September). Near
real-time atrocity event coding. In Intelligence
and Security Informatics (ISI), 2016 IEEE Conference on (pp. 139-144).
IEEE.
8.
Park, A. J., Beck, B., Fletche, D., Lam, P., & Tsang, H. H. (2016, August).
Temporal analysis of radical dark web forum users. In Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM
International Conference on (pp. 880-883). IEEE.
MalwareDetect source code for "Malware Detection Framework Using Static Analysis Approach" available on GitHub: https://github.com/helloram52/detectmalware
Download PDF version of ISI Research and Analysis Tool Inventory