
Do you sometimes feel overwhelmed with manual work during your OSINT investigation? Not all of the work should be automated, but definitely some part of it! You can write custom Python scripts and run it when you need, but what if you don’t know Python, or if you just don’t have enough time for implementing more complex tasks? Or perhaps, you want to build a larger, scalable system for your team? In this respect, NiFi comes to the rescue!
Apache NiFi – very quick intro
NiagaraFiles
National Security Agency started the NiagaraFiles project in 2006 and eight years later open-sourced it as NiFi. Currently, it’s under the Apache Foundation. According to the official site: “Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic“. In most cases, it’s part of wider Big Data systems, but you don’t have to operate on huge volumes of data to use NiFi as its value does not only lie in scalable architecture.
Basic components of NiFi
From the OSINT point of view, NiFi is particularly useful when it comes to data gathering and preparation for an analysis. NiFi provides a web-based user interface where you can build your data flow using two main elements: Processors and Relationships that connect them. Processors can “produce” data or transform the existing one. Let’s dive a little bit deeper into those components:
- Processors are a kind of blocks providing certain actions. We can divide them into three groups (it’s not an official description, though):
- Input Processors – without incoming connections. Those are responsible for starting the whole flow, e.g. you can use InvokeHTTP processor to gather data from some website or API.
- Transforming Processors with incoming connections intended for modifications on the incoming data. There are plenty of built-in processors for this kind of work.
- Output Processors providing writing capabilities, which are intended for data ingestion. NiFi is shipped with many ready-to-use processors that you can just drag and drop, e.g. PutElasticsearchHttp processor.
- Connections (relationships) – when you create at least two Processors, you can connect them using relations provided by the Processor where the connection starts. In many cases, you have more kinds of relations, but the following types are the most common combinations:
- Success – such a type of connection will be used to move “data packets” that have successfully passed through Processor.
- Failure – the opposite situation.
FlowFiles
Connections have a form of a queue, so data coming from one Processor to another could stay there until a given Processor is ready to start processing the next packet. Those “data packets” moving through NiFi flow are called FlowFiles. Each FlowFile consists of attributes (metadata) and content. Let’s assume we have a simple flow:
As you can see, there are three Processors:
- GetTweet – gathers tweets from Twitter Streaming API and pushes each tweet into a single FlowFile to “Success” connection. Raw Tweet is placed in the FlowFile’s content, but there is also some metadata in attributes.
- EvaluateJsonPath – parses the incoming tweets, retrieves
id
field from each FlowFile JSON content and puts it into the attribute. We do this because inside PutElasticsearchHttp we need to use this ID as an Elasticsearch document unique identifier. - PutElasticsearchHttp – writes the FlowFile’s JSON contents into Elasticsearch.
Once you check the first connection, you can see that there are 176 FlowFiles queued.
OSINT use cases
Let’s go back to the beginning for a moment. What are real-life situations where NiFi could be used in OSINT investigation? Here is a short list based on my experience, but if you get the general idea of NiFi, you will definitely figure out some examples strictly useful for your own work purposes.
Crawling Twitter
You can build a complex NiFi flow putting Python scripts (responsible for requesting an action to TT API) where the information gathered is used to trigger another flow. Let’s assume you have a list of hashtags that should be monitored. Start Processor downloads tweets and push them into a flow. At some point, you can split your flow which, as a result, clones the original FlowFiles. Consequently, one flow goes further into the PutElasticsearchHttp processor but another one goes to the second crawler, which gathers all the users’ followers using hashtags from your list. The downloaded followers could be pushed to some database or routed to another flow.
Scrapping websites
By using the InvokeHTTP processor and GetHTMLElement, you can download websites and then parse them. You can just enter final information to the database or use it in other flow. Let’s assume you have different sites providing a list of interesting usernames. You can scrap it, and then run (as a script in the Processor) one of the existing OSINT tools (like Sherlock) to check the existence of usernames on social media. Once you build such a flow, you can run it periodically or just whenever you need.
Monitoring pastebin.com
There are websites crawling pastebin-like sites and providing a search option. You can also use google dorks to find something, but you have to wait until the latest pastebin content is indexed to such a search service. Let’s suppose you know that some sensitive (and crucial for your investigation) content will probably be published on pastebin.com. You can build a NiFi flow with built-in components to scrap and filter the latest pastes and to be notified of the existence of the content with a particular keyword. We will build such a sample flow in the next part of this article.
I’ve presented three use cases, but NiFi is very flexible, so you can use it to solve many different problems. Now, let’s move on to the practical part!
NiFi installation
The NiFi installation process is quite simple. Keep in mind that we’re not focused on the production environment. We’ll run NiFi with the default configuration. There are two options of launching it:
Docker – in most cases, it’s a preferred way, especially when you want to just start without configuring the tool from scratch. Unfortunately, the current Docker image of NiFi has some issues with keeping the data flow persisted between container runs. Of course, it is possible, but it needs some workaround.
Manual installation – it requires the following steps. This guide is intended for Linux/macOS users but it is very similar for Windows. Please check the difference in the official documentation.
- Install Java 8 or 11. Also, please set
JAVA_HOME
environment variable pointing to the directory with the installed Java. - Download NiFi 1.14.0 (nifi-1.14.0-bin.tar.gz or nifi-1.14.0-bin.zip) from https://nifi.apache.org/download.html
- Unpack it (
tar xzvf nifi-1.14.0-bin.tar.gz
orunzip nifi-1.14.0-bin.zip
) - Go to the unpacked directory and run:
bin/nifi.sh start
- After the first run, NiFi will generate a new user. You have to check the login and password in the logs. The following command will help you do so:
grep -i generated logs/nifi-app*
Please wait a moment before you start opening the Web UI browser. NiFi needs some time to do all the activities required for work. Usually, after 15-30 seconds, you can open NiFi Web UI: https://localhost:8443/nifi and log in using the credentials provided in the logs.
You should see empty workspace similar to this one:
Pastebin monitoring tool in NiFi
In the next article, we’ll go through all the steps of creating a simple monitoring tool for the content published on Pastebin. It’ll help you understand:
- how data could be gathered from external service,
- what we can do to retrieve something from HTML site,
- how we can add a condition checking the content with Regular Expressions,
- how to send a notification on a Slack channel.
Stay tuned and follow me on Twitter (@jca3s)!