Today we want to make a basic introduction to Apache Solr and then (in the following post) use it as an output storage for Metadata Digger. Why Solr? It will allow you for searching through metadata using all great features of Full Text Search engine. Before we start, let’s go through some basic concepts of Solr. We will explain overall architecture and give some use cases to convince you to at least try Solr in your OSINT research.
At first we should mention that Solr is based on Apache Lucene – Full Text Search library. In most cases, as a user, you won’t touch this part of Solr directly but it’s good to know it because Lucene is a kind of heart of the whole system. Solr adds some important layers on top of it, providing a great Open Source product.
Solr for Big Data
Solr has been widely adopted as a backend search platform for many commercial services. It is also included in some Hadoop distributions but why is it good as Search platform for Big Data? It has a built-in mechanisms that makes Solr scalable and fault-tolerant:
- Replication – you can specify if collection (equivalent for table in relational databases) will be stored in two or more copies. When one replica is down, you can use another one.
- Sharding – when it comes to real Big Data (not just a buzzword) it is not possible (or at least not reasonable due to hardware costs at some stage) to keep all data on a single machine. You can split your collection into shards that will be stored on different machines. One shard utilizes one CPU core.
Scalability and fault-tolerance is handled by more mechanisms than those two but for now it is not necessary to go deeper into it.
Solr can be run in two modes:
- Standalone – without sharding and all scalability features.
- SolrCloud – provides support for running Solr on many nodes using sharding and replication features. This type requires one more system – Apache ZooKeeper
We will focus on SolCloud option. What is ZooKeeper and why is it important?
Supposing you want to build a system that could be run on many servers and is able to communicate between each other. We need some kind of coordinator which keeps information about:
- live nodes with addresses (hostnams or IPs)
- configuration of particular datasets (collections in case of Solr)
This kind of functionality (called clustering) could be implemented as an additional layer in target system or could be moved out of particular technology. Creators of Solr decided to use external system – Apache ZooKeeper. The biggest competitor of Solr – ElasticSearch has another approach – built-in clustering.
ZooKeeper is used in many other technologies and it is battle-tested.
Solr for OSINT
Maybe you have various (in format and volume) valuable data. This could be a result of crawling Twitter, Facebook, LinkedIn or other websites, forums, etc. and faces problems like – how to find information based on specific criteria? Sometimes grep is not sufficient 😉 I know – OSINT very often means searching on external websites/services like Google, DuckDuckGo, etc. However, there are situations when it is better to crawl/download a bigger part of data and then analyze it offline. You do not leave information in those services about area of your interests using filtering and keywords.
Solr could help you build your own search system. It’s not very easy tool to start but I’m sure you can handle it for your purposes, especially as there are many automating scripts for Solr setup (we have also our Docker image for starting Solr from scratch). So, which features could be the most interesting from OSINT researcher point of view? Here is my list:
- We should begin with the most obvious – Full Text Search. You can index your Twitter, Facebook, LinkedIn data, logs, even binary documents like PDFs and than just search by content!
- Advanced query language with filtering by dates, numbers, geolocation (ranges, lower, greater, equals, etc.)
- Synonyms – you can build your own list of synonyms and push to Solr to boost your search results. When you are working on a topic connected with a specific jargon, it is possible to build such a list which will speed up your research work in most of cases.
- Fuzzy Search. It allows to specify “tolerance” of search algorithm for searching similar phrases. Let’s suppose you don’t remember the exact name of some person and want to find everything about him/her in your “database”. This name sounds like “Kowalsky” but if you are not sure, you can use a phrase “Kowalsky~1” and it will find also documents with word “Kowalski” (1 letter of difference). Algorithm is based on Damerau-Levenshtein Distance.
- Spellchecker. Similar (in result) to Fuzzy Search but you should make all setup of comparison algorithm on collection configuration level.
- MoreLikeThis feature. It’s a kind of recommendation system because it provides you X number of similar documents to the main result list. It could be a nice feature boosting your work.
- Facets – simply put, faceting provides a list of unique values for particular field (tweet language for instance) with numerical count of how many found documents are included in search results. Assuming you indexed Tweets and want to know how many posts with phrase “Kowalski” were written in Polish, English and Russian languages. Faceting is the way to go into this case. Just set Tweet language as facet field and add your filtering criteria.
- Graph analysis. Some time ago Solr introduced modules for graphs. The first one is Graph Query Parser which is intended for searching documents considering relations between them. It has a major limitation – data must fit into single shard, so you cannot take advantage of distributed computing here. Second type (very similar from user perspective) has been implemented using powerful Solr Streaming API – Graph Traversal. Using it, you can query data distributed across multi-node cluster. You can index data containing relations between entities (like friends, followers, hashtags) and then run a query to select connected documents. In current version Solr doesn’t provide very advanced graph algorithms but it’s still worth to check, especially that you can combain it with all Full Text Search features. See documentation for more info about it. We are going to write a post presenting real life use case for SNA (Social Network Analysis) with Spark and Solr. If you are interested in any particular examples, just leave a comment or write a message (dev@datahunters.ai) and we will try to focus on it.
- Advanced mechanisms preparing data before indexing and during search. These are called: Analyzers, Tokenizers and Filters. You can use the existing one or build your own in Java and then use in Solr configuration.
Basic definitions
Ok, we know some basics about Solr architecture and practical use cases. Now it’s time to learn what is document, collection, indexing and committing:
- Document – single entity like: tweet, user, post, etc. A document contains fields with values. Names of fields with type (text, numerical, geolocation, boolean, etc.) needs to be defined in Schema. It is a separated XML file in required format. Fields normally need to have a full name defined in schema. However, there you can create dynamic field which could have only a prefix specified which is helpful if you don’t know all field names on configuration phase. Let’s suppose you want to index Exif tags as separated fields. You don’t have the final list (I know, there is a kind of specification but in real life you could have anything under “Exif”), so you are defining field with prefix “exif_” and set “Text” type for all of them.
- Collection – set of documents. What should be included in single collection? Sometimes it is good to have a bag of documents representing different types of data (Tweets in the same collection with posts from some forum, Facebook posts, etc.) but in some cases it will be hard to search it. Mostly due to many dynamic fields (some document will have 10 fields, another one 50 – depending on a source, and it could end with total mess), so you can create a dedicated collection with more specific Schema. To create a collection you need to have at least two files:
solrconfig.xml
andmanaged-schema
(or legacy –schema.xml
). You can get defaultsolrconfig.xml
and make small changes to defaultmanaged-schema
without going into advanced options. - Indexing. It’s a process of pushing data into collections. One important thing here. It doesn’t mean that your data are ready for searching (see commit below).
- Committing is an operation starting actual changes on Solr collection. You have indexed data but to make it available for search, you should make a commit. It’s a heavy operation, so i’s not reasonable to make it very often (especially in case of huge data sets). There are couple of strategies regarding commit. You can force commit after indexing whole dataset or configure Solr to make autocommit after X milliseconds. Of course you can have more complex approaches here.
Solr Admin UI
Solr has many powerful features but you must be aware that it doesn’t provide a user friendly graphical interface with visualizations. If you need it, you can:
- Integrate it with Apache Zeppelin (we will show some examples in the future).
- Use Apache Hue which provides a module for building interactive dashboards based on Solr results from predefined blocks (filters, charts, maps, tables, etc.).
- Use Banana (port from Kibana – visualization app for ElasticSearch – Solr competitor) providing configurable dashboards with filters, charts and tables.
- Use any other tool that is able to visualize data from SQL (JDBC) source – yes, Solr allows also for search with SQL 🙂
- You can also try to configure Velocity Response Writer (Solr builtin option) to prepare some interesting views.
- Build your own app 🙂 Solr provides REST API and integrates with many programming languages.
We don’t want to go into details about pros and cons of above tools right now. Let’s go back to components shipped with Solr. Basically you can access Solr directly via API or using Solr Admin UI. Second way is a better option when you are beginning with this tool. The list of things you can do with administration app is quite long but these are the most interesting at the beginning:
- Creating and deleting collections
- Displaying configuration and schema of collection
- Adding (Indexing) and deleting documents via form
- Searching with all advanced options (results will be presented in XML or JSON format)
- Displaying all shards and replicas (with information about failures) on graph
Next steps
You’ve just had a quick introduction to Solr. It’s only theory. We like practice, so in the following post we will show how to run SolrCloud (using our Docker image), write (index) output from Metadata Digger to Solr and search using Solr Admin UI. In the mean time, you can read about starting Solr and try to index sample Tweets or PDFs on your own. There are plenty of resources in this area on the Internet, so we decided to not add another similar tutorial 😉 However, if you have any problems with it, just leave a comment and we will try to help you 🙂
See you soon!