
This is the second part of the article about Big Data in OSINT. You can read the first part here.
OSINT cycle
In the previous post, I mentioned the following phases of OSINT investigation:
- Planning and direction
- Collection/Gathering
- Processing and Analysis
- Dissemination
- Feedback
We discussed the Processing and Analysis stage, and we also went through Processing (Batch and Stream) and Online Querying. Now, let’s continue with the next topic – Advanced Analytics (ML) and Visualizations.
Processing and Analysis
Advanced Analytics (including ML)
We can analyze all data we collected manually, but it is a much better idea to use methods like Machine Learning (part of a wider area – Artificial Intelligence). ML is a very complex topic, but in short: the main goal we want to achieve in this area is to build a model which can detect some patterns in data. We can train a model based on labeled data (e.g. articles marked by the human as fake news or not) or data without labels.
From the perspective of an OSINT investigator, the most interesting usage is that of a trained model. It could be provided as a file with an application that can handle it like YOLO, which can detect objects on images. It is not strictly considered as a Big Data tool, but it’s very effective and could be integrated with technologies like NiFi to process many images near-real time. You can also use the tool we’ve mentioned – Metadata Digger to detect images with particular objects on a large number of files.
The Internet is full of ML-based solutions that could be helpful in OSINT investigations. Downloading a model and an application on your local machine is one way. Another one consists of using external services with Web UI or dedicated API. The second option is particularly useful as you can integrate it with a Big Data tool such as Spark, NiFi, etc. Three leading cloud providers offer interesting services in this area. The table below presents services in the most important areas.
Cloud based solutions
Amazon (AWS) | Google (GCP) | Microsoft (Azure) | |
Computer Vision (OCR, image/video classification and object detection) | Rekognition | Vision AI | Computer Vision |
NLP (processing text, extracting entities, etc.) | Comprehend Textract | Natural Language AI | Cognitive Service for Language |
Speech to text conversion | Transcribe | Speech-to-Text | Speech Services |
Text to Speech conversion | Polly | Text-to-Speech | Speech Services |
Language translation | Translate | Translation AI | Translator |
Computer Vision is a set of methods/algorithms that can automate tasks on Images and Videos. In OSINT, it could be the detection of images with people, buildings, military objects, etc. (classification problem). It could also consist of the detection of DeepFake avatars and, of course, generating them. When you make use of image search service, you use Computer Vision under the hood.
NLP (Natural Language Processing) can be useful when you have tons of articles, PDFs, Word documents, etc., and you need to extract names and addresses. Additionally, it can also be used if you want to get a short summary to have a general overview of the text. It also has similar use cases like Computer Vision, as you can try to detect a DeepFake article or generate it. Topic Modeling is another useful method which allows you to automatically divide a set of documents into groups of topics. When you investigate disinformation by using posts from social media, you can check which users/groups write about given topics.
Speech to text is a useful technique when you have some recordings and you need to verify if there are some keywords/topics there. Converting speech to text and indexing it in one of the Full-Text Search engines that I’ve mentioned before can provide a valuable source of information.
Text to speech is something that is probably not so useful from the OSINT point of view, but it is worth mentioning here. When we think about Text to speech, we can often see/hear an unnatural computer voice. However, in the age of DeepFakes, it is a very dangerous method. Sometimes it is enough to have samples of the target person’s voice to generate speech that will be very difficult to be detected as fake.
Language translation is such a popular topic that I will leave it without a comment 🙂
The methods and cloud services presented above are just examples. I haven’t described many Open Source alternatives this time as it’s a very wide area, and we would need a separate article to cover the most important libraries.
AI/ML topic is very wide and complex, but it could definitely speed up your investigations. If you have any question related to AI use cases, leave a comment or just drop me a line: jan@datahunters.ai.
Visualizations
Good visualization of data could provide you, your team and your boss with some critical insights. It can be achieved with all sorts of technologies. I won’t cover various Python libraries for drawing charts and network graphs. I would rather focus on solutions which provide UI for creating dashboards backed by a database.
Let’s start with Open Source technologies. Kibana is one of such popular tools thanks to which you can visualize data stored in Elasticsearch. The first step after installing Kibana is to build a dashboard. Below, you can see some examples.



What if you don’t want to use Elasticsearch as a database for a visualization tool? An interesting option would be Apache Superset. It can connect to a database with SQL. You can see the full list of supported databases on the official site. Below, you can check what Superset looks like based on examples.



Another interesting option is Grafana, which was created mostly for logs visualization purposes. It’s similar to Kibana, but it supports many more data sources than just Elasticsearch. See the examples of dashboards below.



There are also commercial alternatives known in the Business Intelligence area such as Power BI or Tableau, but the Open Source technologies presented above are a good starting point and should be sufficient in many cases. However, there is also a very interesting visualization technology that is worth mentioning when it comes to OSINT investigations – Graphistry. You can use it to visualize big networks/graphs. To check how it could be used to visualize users’ skills, see my article – StackOverflow – technology map.

Dissemination and Feedback
These two parts can make use of all visual Big Data components presenting information. Starting from dashboards, interactive graphs or just static charts.
Last but not least – and to be honest – one of the most important aspects: history. Thanks to the Big Data ecosystem, you can store data and conclusions from previous investigations. It’s a great source of information as you can find similar patterns, compare your old findings with the new ones, but also track quality and progress of your research.
A good example could be investigation focused on detecting disinformation on social media. If you keep and analyze data in the Big Data ecosystem, you can mark accounts identified as trolls or bots. In further investigations, you can check relations to those accounts taking into consideration the whole history. We have at least two benefits – taking advantage of the previous analysis (yours and your team’s) and historical data which could be removed from the Internet for some reason.
Planning and direction
I’ve intentionally left the first step of the OSINT investigation until the end. Once you know what the potential of Big Data is and how historical data could help you with connecting the dots, it’s easy to imagine how it could help you with planning new investigations. When you use your Big Data ecosystem and feed it with data and conclusions from your analysis, it’s like training a big brain.
Summary
The technologies and approaches that I’ve described are just the tip of the iceberg, but I hope it has shown the main concept of how Big Data can be used in OSINT. If you have an idea of how Big Data can boost your investigations, but you are not sure if it’s a good fit, leave a comment or contact me (jan@datahunters.ai)!