
OSINT supported by a Big Data ecosystem can take investigations to the new level. In this article, I will go through a couple of Big Data areas and technologies to show its benefits for OSINT.
Big Data areas
In short, Big Data is an approach to the problem of huge and increasing volumes of different kinds of data. For me, the most fascinating aspect around this topic is the value you can get thanks to:
- the fusion of various data sources;
- the ability to process and retrieve useful information from unstructured data (e.g. sound, image, video, etc.);
- different patterns coming from huge volumes of data;
- the ability to manage historical data.
Of course, there are many other aspects but, in my opinion, the elements presented above make the core of Big Data.
Big Data is a large bag containing various technologies, architectures, methodologies, etc. This article is just the tip of the iceberg, but I hope to make you interested in this topic.
Should I use Big Data?
As always – I would say that it depends. I hope you will shape your own opinion after reading this article, but there are obvious cases where using Big Data doesn’t make sense. If you are starting with OSINT, and you do it from time to time, or you cannot keep the history (at least part of it) of your investigations for some reason, then it’s probably not the best idea to use Big Data.
You also have to remember that, in most cases, it’s not so easy to set up a Big Data ecosystem, but if you have lots of investigations, work in the team, and you feel overwhelmed with data volume or variety, then it’s worth checking if Big Data can come in handy.
Recommended technologies
One note before we dive into examples. The core of Big Data comes from Open Source technologies with a visible leader – Apache Foundation, which is a kind of umbrella for Open Source software. There are, of course, proprietary software products in this area, but the rise of Big Data has its origin in Open Source products. Most of them provide paid support for companies that require guarantees of availability.
Over the past few years, the view of the Big Data market has started to change. Three leaders of cloud services (AWS, GCP, Azure) started to offer Big Data managed services, so it’s usually much easier to get started – especially that, in many cases, such services are based on Open Source technologies. We can now observe a hype for Cloud in most companies. Does it sound reasonable? I don’t think so, but it’s a topic for another article. In this post, I will focus on Open Source technologies – but from time to time, I will also mention cloud alternatives.
OSINT cycle
There are many different approaches to the OSINT investigations but in general, a common cycle consists of the following phases:
- Planning and direction
- Collection/Gathering
- Processing and Analysis
- Dissemination
- Feedback
Let’s go through all of these steps and find out how and where Big Data can be of help. We will start from the second phase and finish with the first one.
Collection
This phase concerns three Big Data layers: Storage, Data Collection and Data Ingestion.
Storage
One of the best (in my opinion) assumptions in the Big Data world is that failure of the system is something that just happens. It’s nothing exceptional, so we need to build a system that can simply handle it. The first and most popular Big Data storage technology was built with this principle in mind. It’s Hadoop Distributed File System (HDFS), part and core of Apache Hadoop. From the user perspective, HDFS is similar to the local file system. You can operate on it in a similar way it’s done with Linux. Even basic commands such as: cp
, mkdir
, rm
, cat
, etc. are very similar in terms of syntax. The magic is behind:
- Distributing over many nodes (machines) – when you enter data into HDFS, it is split into blocks and persists on many machines to avoid overloading a single node.
- Replication – all files are copied at least twice (3 times by default) on different machines. If something happens to one node, HDFS can automatically create a copy of the data stored on the failed machine to always have X replicas.
Thanks to such mechanisms – and a few others – the data pushed can be accessible from different places/machines, so if you save the files from your investigations with appropriate permissions, your team can obtain access. What is also great about HDFS is that you don’t need very expensive hardware. You can install it on the same disks you use every day.
HDFS is relatively good, but right now we have lots of other options regarding distributed file/object storage. For the sake of simplicity, I will assume that a file and an object stores are the same things – from your perspective, it’s kind of a store for your files. Right now, the Apache Foundation has an interesting Open Source alternative to HDFS – Apache Ozone. If you like Cloud, you can choose Amazon S3, Google Cloud Storage, Azure Blob Storage or Azure Data Lake Storage Gen2. These are just examples – the sky is the limit.
Data Collection and Ingestion
Gathering data from various sources in an automatic way is not such a simple task. In OSINT, sometimes it’s just not possible or not worth any time and money automating each task related to gathering data. However, even if you do something manually, you can store it in one, safe place. You can also build a mechanism which will pick up your files from one location, make some basic processing and put in the right place.
In 2006, the National Security Agency created a NiagaraFiles system and after couple of years published it as an Open Source technology called NiFi (under the Apache Foundation). It allows to build advanced data flows using the Web Graphical User Interface and many built-in components. Within a short amount of time, and without any programming, you can build flows collecting data from simple Web pages and internal databases that will put it into some place (other database, local file system, HDFS, etc.). Check out my article on building a Pastebin monitoring tool with NiFi. If you know Python, you can build a more advanced crawling tool that can gather data from complex pages and integrate it with NiFi. It’s definitely worth checking!
Processing and Analysis
This part can leverage the following Big Data layers:
- Processing (Batch and Stream);
- Online Querying;
- Advanced Analytics (including AI/ML);
- Visualizations.
Processing
Let’s assume we have some data collected during the previous phase. It could be anything: leaked credentials, images, tweets, etc. We’ve decided to use a distributed file system (like HDFS) and now, we need to process the data to obtain some insights/conclusions. For simple data transformations (such as retrieving information from JSON/CSV files), NiFi may come in handy, but I wouldn’t recommend to use it for more complex processing. For such a job, you have one of the most popular Big Data processing frameworks – Apache Spark. It requires programming skills (Python, Scala, Java or R), but with this technology, you can build heavy processing pipelines on large volumes of data. All the biggest cloud providers have a dedicated managed service for Spark, so you can just try it without figuring out how to configure it. If you are not a programmer, but you want to leverage Spark, try to find Open Source applications based on Spark. One example could be our tool for metadata (EXIF) extraction and analysis – Metadata Digger.
If you know SQL and want to analyze data by means of it, you can also use Spark, but there are also other interesting technologies that provide a SQL interface and make heavy calculations underneath (e.g. Apache Hive, PrestoDB, etc.).
You can think of building a monitoring tool which can notify you when some information appears in a specific part of the Internet. It could be an image with some objects (weapon, people, military building, etc.) or text containing some phrases. Such a tool could be very useful when you have a long-lasting investigation, and you want to have some kind of a trigger pointing to the right place for an analysis. Usually, you need to use a Streaming technology for sending and processing data to ensure near real-time feedback. Once again – Apache Spark can be of help, but also other technologies like Apache Flink or/and Apache Kafka (with Kafka Streams) can come in handy.
Online Querying
Building complex processing jobs is not always possible or may not always be the best idea. Sometimes, you want to just search data from current and historical investigations based on simple text and some metadata. The Open Source world provides two interesting and similar technologies in this area: Elasticsearch and Apache Solr. These are Full-Text Search engines and they allow you to store (index) your data, type phrase or complex queries (similar to Google Dorks/X-Ray) and to get results in milliseconds (sometime seconds ;)). NiFi has built-in components for pushing data into both of these databases/engines. Spark doesn’t fall behind as it also provides support for reading and writing to ES and Solr. A good example is – once again – Metadata Digger, which retrieves metadata from images, recognizes objects in a picture with Spark and saves results to Solr.
In addition to Full-Text Search engines/databases, it’s worth mentioning other NoSQL solutions. In OSINT, you can have data where records are in relation of time, e.g. logs from devices or some user actions. These are called Time Series. Big Data offers quite interesting options for near real-time queries over such kind of data, e.g. Apache Druid. What is very interesting is that Druid is prepared for an integration with streaming technologies like Kafka, so you can attach Druid to the monitoring tool I mentioned as an example in the previous section and obtain near real time data. There are also similar technologies where you can store data from streaming in a columnar way like Apache Pinot. It also supports SQL as a query language.
What’s next?
In the next part of this article, we will go through other areas that could be used in the Processing and Analysis phase such as Machine Learning and Visualizations. We will also check how all of these Big Data elements can be utilized in other steps of the OSINT cycle: Dissemination, Feedback, Planning and Direction. Stay tuned!