Use of NiFi for OSINT automation purposes (part II)

Apache NiFi

This is the second article that focuses on using NiFi to automate OSINT activities. This time, we get straight to the practical part. If you haven’t read the first part of this tutorial, you can find it here.

Pastebin monitoring tool in NiFi

Goal

Let’s create some basic but practical data flow. Our goal is to build a monitoring tool for newly added content on pastebin.com. It will send a notification on Slack if the monitored content has at least one piece of the following information:

  1. Passwords or usernames – we will use a very simple rule here. If the text contains one of the following keywords: “password:”, “pass:”, “user:”, “username:”, a notification should be sent. I know, it’s quite a naive rule, but the first flow shouldn’t be very hard :).
  2. Existence of an email from secmail.pro domain. If the content has such an email, it should be detected, extracted from the text and then used in a notification.

The above rules are very simple, but when you understand how to build a flow in NiFi, it’s possible to make it much more sophisticated. You can follow my detailed instructions to build it manually or just download a ready-made flow from Github and import it to NiFi (Github Readme describes how to do so).

OK, let’s get started! We want to build the following flow:

Final NiFi flow – Pastebin.com monitoring tool.

Collecting a list of latest pastes

In Article 1, I mentioned that NiFi provides many built-in Processors. In order to use them, you have to drag and drop the Processor icon placed in the top-left corner. See the screen below:

NiFi menu with “Processor” icon selected.

After dropping the icon, a small window is opened and you can see a list of available Processors with their descriptions. We need to choose something which will allow us to gather HTML of https://pastebin.com/archive. It contains a list of the latest Pastes. InvokeHTTP Processor is the one you’re looking for. Once you drag and drop it, double-click and apply the following actions:

  • Change the name to “Get pastebins list” (or whichever you want).
  • Select all the relationships except for “Response” to be Automatically Terminated (list on the right of the Settings tab).
  • Go to Properties and set “https://pastebin.com/archive” to “Remote URL” property.
  • Go to the Scheduling tab and change the value of the “Run schedule” field. It defines the frequency of running this Processor, which finally means – “how long it will wait between checking Pastebin.com for the latest changes”. Set it to 120 sec.
  • Click “Apply” to confirm it.

Once you terminate a relationship, it means that the data routed to it won’t be handled, so you’ll lose it.

Congrats! You’ve just configured your first Processor! Let’s move on to the second one.

Retrieving URL

Now, drag and drop GetHTMLElement processor and configure it to build an action responsible for URLs retrieval from the list of the latest Pastes. This part requires some HTML/CSS knowledge, especially inspecting pages and finding particular HTML elements with CSS Selectors. If you don’t know how to do this, you can just copy-paste the settings, but you can also read the following articles:

Let’s configure our Processor. Settings tab:

  • Name: “Retrieve URLs”
  • Automatically Terminated Relationships: select all of them except for “success” ones.

Properties tab:

  • URL: “https://pastebin.com”
  • CSS Selector: “.maintable td:not(.td_smaller) a”. CSS Selectors are kind of patterns which allow you to find particular elements on HTML/CSS page. If you open the source code of https://pastebin.com/archive, you’ll see that all links to the latest Pastes are placed in the HTML Table, but there are two columns containing an <a> element (HTML representation of the link): Name / Title (it’s our target link as it points to the direct location of the paste) and Syntax – and we don’t want to gather Syntax links. Unfortunately, the table cell with our target link doesn’t have any unique ID or CSS class, so we cannot obtain it directly. I’ve built a selector which obtains all the links from the table (marked with maintable class) which are not included in the Syntax column (it has td_smaller class): “.maintable td:not(.td_smaller) a”
  • Output Type: “Attribute”. We are not interested in the text of the link or the whole HTML element. We just need “href” attribute which contains URL.
  • Attribute name: “href”

Now, we have two Processors configured. We need to connect them, so click on the center of the first one (InvokeHTTP) and move the line to the second one (GetHTMLElement). Once you release the button, a small window will be displayed with information on the type of relation. Select “Response”. It means that only the right responses (statuses: 2xx) from https://pastebin.com/archive will be handled here. Now, let’s move on to the third block.

Crawling content

Add new InvokeHTTP processor to our workspace. It’ll be responsible for crawling particular Pastes. Open the configuration of this processor and set up the following:

Settings tab:

  • Name: “Get pastebin content”.
  • Automatically Terminated Relationships: select all of them except for “Response” ones.

Properties tab:

  • Remote URL: “https://pastebin.com/raw${HTMLElement}”. This URL contains a special string: ${HTMLElement}. Here, we use the FlowFile attribute (do you remember that FlowFile consists of attributes and content?) of the name: HTMLElement. This attribute keeps the value assigned in GetHTMLElement processor – so in our case, it’s URL to the Paste retrieved from HTML.

You should now connect the GetHTMLElement with the InvokeHTTP Processor that has just been created – use the “success” relation.

Until the last Processor, we had a single flow without forks, but now we want to use a response from the “Get pastebin content” Processor to apply two rules. That’s why we have to create two Processors. Let’s start from the first rule related to the username and passwords check.

Detecting usernames and passwords

Please add the RouteOnContent Processor and adjust the configuration. This Processor is kind of a router. You can specify regular expression and the Processor will check if content of FlowFile matches the regex.

Settings tab:

  • Name: “Check if the content has a user or pass”;
  • Automatically Terminated Relationships: select all of them except for “unmatched” ones.

Properties tab:

  • Match the requirements: “content must contain a match” (it could also be the second option – stricter, but in such a case our regex should also be changed).
  • Click “+” button and add a new attribute with the name: “Contains Password or Username” and value: “(^|\s)(password:|pass:|user:|username:)”. This Processor adds a new output relationship for each new attribute, and routes FlowFile to it if the content matches the regex used as the value.

Please join the “Get pastebin content” Processor with the newly created one using the “Response” relationship.

Now, we can add the last Processor to this branch of the flow or fork response. Let’s finish the branch we’ve started.

Slack notification

The main assumption here is that you need to have a Slack account with permissions to create webhooks. Drag and drop the PutSlack Processor and configure it:

Settings tab:

  • Name: “Send Slack notification”
  • Automatically Terminated Relationships: select all of them.

Properties tab:

  • Webhook URL. I assume you know how to create Webhook in Slack. It’s very easy but in case you don’t know, follow the official instructions: https://slack.com/intl/en-pl/help/articles/115005265063-Incoming-webhooks-for-Slack
  • Webhook Text: “Pastebin content with a username or password has been detected. Check it here: https://pastebin.com${HTMLElement}”. As you can see, we are using the same FlowFile attribute again.

Connect your RouteOnContent Processor with PutSlack using “Contains Password or Username” relationship. FlowFile will be routed to the PutSlack only if it matches the regex.

It seems that you have finished the flow. You can test it, but before you start, let’s add the second branch with the “secmail.pro” rule.

Retrieving secmail.pro addresses

Go back to the “Get pastebin content” (InvokeHTTP) Processor and put the ExtractText Processor below. It’ll retrieve the first secmail.pro email from the content of FlowFile.

Settings tab:

  • Name: “Retrieve secmail.pro email”
  • Automatically Terminated Relationships: select “unmatched” (in this case, “unmatched” is used when the content doesn’t have an email with a secmail.pro domain).

Properties tab:

  • Enable Multiline mode: “true”
  • Click “+” button and add a new attribute with the name “secmail_address” and the value: “[\S]*@secmail.pro(?=(\s|,|\.|!|\?)|$)”. It’s not a very precise validation regex, but it’s sufficient enough for our purposes.

ExtractText Processor will try to extract from the content a string matching the regex. If it succeeds, it’ll route FlowFile to the “matched” relationship. In other case, to “unmatched”.

Please connect “Get pastebin content” with “Retrieve secmail.po email” Processors using the “matched” relation.

Cloning Slack notifier

Now, we will add the last processor. To make your work easier, select the “Send Slack notification” Processor, and copy & paste (just: CTRL+C, CTRL+V on Linux and Windows). Move it under the ExtractText Processor and change only one configuration in the Properties tab (the rest is ok as it is a clone of the existing PutSlack Processor) – Webhook Text: “Pastebin content with ${secmail_address.0} email has been detected. Check it here: https://pastebin.com${HTMLElement}”. Here, you can see that we’ve used two FlowFile attributes:

  • secmail_address.0 – result of regexp matching
  • HTMLElement – URL to Paste

Our last action to finish the whole flow involves creating a connection between ExtractText and the cloned PutSlack Processor with the “matched” relationship.

Testing Pastebin Monitoring tool

We can now start the whole flow or just part of it. There is one very useful (in testing) Processor: GenerateFlowFile. You can add it and fill in the “Custom Text” property to some text containing the phrase “password:”. Then, create a new attribute (“+” button) with the name: HTMLElement and value: /test123. After that, connect this Processor to “Check if content has user or pass”. Click the right button on GenerateFlowFile and “Run once”. Next, click the right button on workspace and “Refresh”. You should see that in the queue (connection) between GenerateFlowFile and RouteOnContent Processors, there is a one, new FlowFile (see the screen below).

One FlowFile has been queued in the connection between GenerateFlowFile Processor and “Check if content has user or pass”.

Now, Run RouteOnContent, and you’ll see that the Processor has detected a phrase “password:” in the generated FlowFile and has moved it to the “matched” relation. To check what is inside the queue, you can right-click on the connection and then “List queue”. The new window contains up to 100 FlowFiles. By clicking on the “i” icon on the left of each record, you can inspect attributes and content of a particular FlowFile. You should see the attribute: HTMLElement with the value /test123 (the same you created in GenerateFlowFile). This view is very useful when you want to analyze what is going on with your FlowFiles. You can find detailed information there about every change applied by particular Processors.

Run your last Processor – PutSlack and check if you have something similar to this on your Slack channel:

Sample notification on Slack channel.

To clean up our tests – stop the RouteOnContent Processor, click on the connection coming from GenerateFlowFile and remove it.

Running the Pastebin Monitoring tool

To run the whole flow, you have to select all Processors (use Shift key and mouse) and click the “play” button on the left, under the navigation map. Before you do so, please check if the first Processor has the right value of “Run Schedule” (Scheduling tab). The default is 0 sec, which means that there will be no break between runs of this Processors. You should avoid this when you call publicly available sites as it’ll trigger a blocking mechanism sooner than later.

Once you start your flow, you can observe how FlowFiles are running between connections. You can stop each Processor if you want to check the content or attributes of particular FlowFiles.

Congrats! You’ve just finished the first NiFi flow!

Bonus – Homework!

I strongly encourage you to play a little bit with the flow that you’ve just built. You can add other conditional Processors or change the existing ones, but there is one thing that probably comes to your mind when you see our flow – what is going on with all those Pastes gathered from Pastebin.com? Well, they are removed. It’s not optimal to just remove potentially useful information, so your homework is to add connections and new Processors, which will save our Pastes to some database (or just raw files). It could be Elasticsearch – as it will be much easier to search it in the future. In case of ES, you will have to create a JSON object first, and then put it into ES. Remember that it’s important where you add the connection that moves the data to ES. You can use all Pastes or just those that have followed our rules. Good luck!

NiFi aspects not yet covered

In this article, I’ve shown a basic functionality of NiFi. As you can see, building flows is quite easy. You can take advantage of the built-in Processors and connect them in order to build very complex data flows. Here is a list of other important aspects you should be familiar with if you want to use NiFi for more complex projects:

  • You can group your flows into logical modules called Processor Groups. It will help you keep your flow clean and more readable.
  • Some Processors in our flow had reference in the configuration to FlowFile attributes. You can do much more with it as NiFi provides Expression Language. You should definitely check it out: https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
  • NiFi allows to process data in parallel. Our flow uses a single thread on each Processor, but you can increase it. Another thing you can do is to launch NiFi in cluster so that many servers could run each Processor. In such a case, you can process data not only in many threads (on a single machine) but also on many servers.
  • NiFi has quite a long list of ready-to-use Processors, but there are cases where you want to add some action not implemented in the existing modules. You can create your own Processor using Java (or other JVM language like Scala or Kotlin). Before you do this, as usual – check if someone hasn’t implemented a similar or the same functionality before you ;).
  • Implementing custom Processor is not always necessary. There are Processors supporting scripts. It’s a very good way of integrating the existing OSINT tools with NiFi.
  • In many cases, the order of FlowFiles is not very important, but there are cases where it’s crucial. You cannot assume that NiFi will always keep the order. It depends on your flow, configuration of NiFi and Processors.
  • Your flows could be exported in the form of Template (like I did it with our Pastebin Monitoring tool, please check Github: https://github.com/data-hunters/nifi-pastebin-monitoring-flow).
  • If you are familiar with GIT (e.g. Github) or other Version Control System, you should definitely check NiFi Registry. It provides versioning for your flows. To be honest, for me NiFi Registry is a mandatory part of production projects.

There are also other aspects that you may find interesting as NiFi is a very powerful system. However, you should also remember that NiFi is not a solution to every processing/automation problem.

I hope this example of how NiFi could be used for OSINT automation will be useful in your daily work. If you have any questions and ideas of NiFi usage, or if you want me to write about more advanced functionalities of NiFi, leave a comment here or on Twitter (@jca3s)!

Related Posts
Welcome to our blog!

If you are searching for information related to Big Data, mostly focused on Open Source Intelligence, you are in the Read more

Extracting metadata (Exif) with Metadata Digger

In last post we mentioned about Metadata Digger. It’s a tool for extracting and analyzing metadata from huge amounts of Read more

Video – How to use Digital Ocean Spaces as storage for Metadata Digger

We have just added new screencast about configuring Metadata Digger: https://www.youtube.com/watch?v=oYPD8s60rfc. You will learn: How to configure input to load Read more

Solr – Full Text Search for Big Data and OSINT

Today we want to make a basic introduction to Apache Solr and then (in the following post) use it as Read more

Leave a Reply

Your email address will not be published. Required fields are marked *