| certifications | cloud - Matt McClure
AWS Cloud Search vs ElasticSearch: Which to Use?
Search seems like a pretty mundane feature across the internet that no one really thinks much about. Let’s face it, is not really that sexy of a topic compared to cutting edge things like machine learning or the latest DevOps trends. But we would propose you take a moment and rethink the humble search as we really take for granted how great search is these days.
Think about shopping on Amazon. You type in the title of a book you’ve been wanting in the search bar, hit enter, and you’ll get a page of results within literally two, maybe three seconds. By one estimate, Amazon sells over 75 million products. And one search found what you were looking for in seconds. That is insane; how do they search such a huge database of products so quickly? And it’s not just Amazon. Google, YouTube, Spotify, any given news site, the user experience of them all is mostly based around you searching for some string of text and immediately getting back a thorough and accurate list of relevant results.
Imagine if any of these sites consistently took 10 or more seconds to return search results. You wouldn’t stick around long I bet and would find a faster site. Or imagine you had to wade through page after page of irrelevant results to find what you actually wanted. No store or streaming service would stay in business if you can’t even find what you came there for. Search is a paramount functionality of the web and really a form of magic; instant and relevant results out of millions or billions of rows of data.
Working this magic though for your own site doesn’t have to be anywhere near as complicated as you might think it is. AWS, like many things, offers not one, but two services for building cost-effective, high throughput, low latency search solutions: CloudSearch and ElasticSearch. If they sound pretty similar, they are. But there are subtle differences that might affect which service you choose.
What is CloudSearch?
CloudSearch uses Apache Solr under the hood to power your searches. More accurately, it’s an app based on Solr, modified to be more manageable via the AWS console or API.
In their words, “Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.”
Such power though does not come easy. Powering your site’s search with Solr still requires installation (either on a dedicated server, in Docker, or on a Kubernetes cluster) and configuring additional nodes if you want fault tolerance. Make sure you plan for ongoing maintenance of those servers and upgrades of Solr when necessary.
Running applications like this is something most developers don’t want to spend time on and would rather trade this tedious time sink for a better solution. That solution: CloudSearch.
In AWS’s words, CloudSearch “makes it simple and cost-effective to set up, manage, and scale a search solution for your website or application.” It’s a managed service, like EC2 or S3, where AWS is running the application for you (Solr in this case). The application itself is abstracted away by either the AWS console or APIs. The installation, upgrading, and administration of the software is all handled behind the scenes. This allows you to get to work much quicker and with significantly less administrative overhead; there’s no setting up servers and installing software with managed services.
To set up CloudSearch, first you create a search domain, which “encapsulates a collection of data you want to search, the search instances that process your search requests, and a configuration that controls how your data is indexed and searched.” It’s basically all your data and the options you set up around searching it.
For scalability, as mentioned, CloudSearch supports multiple nodes. Setting up these nodes is the first choice you will make when configuring CloudSearch, namely choosing an instance type and number of nodes. The instance types go from small to 2xlarge, giving you incrementally more data capacity of greater speed to index the data.
Next you need a file to define your index fields. Imagine the columns of a spreadsheet; the index field file must contain every field that your actual data will contain so that CloudSearch knows what to expect with the data. The index field configuration can be a file either that you upload or in an S3 bucket in a format like JSON, XML, or CSV, or it can be a DynamoDB table in your AWS account. Whichever format you use, the next step is to confirm the field types.
Next comes the actual file upload, either from the same file formats or from DynamoDB. One interesting use case is sourcing your file from an S3 bucket. This opens up all kinds of possibilities for automation involving other parts of your app that are dumping data in S3 already.
Imagine dumping all your microservices’ logs to a bucket; multiple log files all filling up with valuable real time data and metrics. Valuable, that is, only if you have some way to search them. Point CloudSearch at that bucket and you have a new valuable tool for getting actionable data from those logs. Your search domain is updated immediately as the data source updates, for accurate real time search results.
Once CloudSearch ingests all the data, no matter the source, you can perform your first search, either via the AWS console or via an HTTP endpoint (with your search parameters in the URL as query strings). Programmatically, you can also search via API, so the limits of what you can do with your searches are pretty much unlimited. For more specific guidance on setting up a search domain using sample data, see the official CloudSearch getting started docs.
What is ElasticSearch?
So that’s CloudSearch, but as you’ll recall the point of this article is not only to elaborate on that service but to highlight the differences between it and ElasticSearch, the other managed search service offered by AWS.
What’s ElasticSearch? Like CloudSearch, it is a managed app run by AWS, with all the server and app admin work done for you, which is always a plus for busy devs. ElasticSearch in this case is the name of the app, one part of a trio of source projects known together as the ELK stack.
Getting started in ElasticSearch on AWS is very similar to CloudSearch with one expectation: you don’t need to submit data to define index fields first, CloudSearch generates the index automatically from the real data. Otherwise it’s identical: create a search domain, upload data, and search from the console or endpoint. Again, the AWS docs offer a valuable starting point.
The scalability is all there with ElasticSearch as well, allowing for multiple nodes in a cluster to handle processing and querying your data.
One additional trick though is using Kibana, another part of the ELK stack. Kibana can be used to create powerful dashboards and beautiful analytics for your data as seen on their site. It will run as a part of your managed ElasticSearch deployment, you will only need to check the AWS console to find the URL endpoint to access your instance.
CloudSearch vs. ElasticSearch
So we’ve looked at both managed “Search-as-a-Service” tools, now it’s time for the real question: which one do you use? Well, like all good tech stack architecture discussions, it depends. The two different search solutions have different strengths, different intended use cases, different ecosystems, and different costs.
The biggest difference: Cloudsearch is fully managed by AWS, there’s not a lot of tinkering under the hood that you can do. This might be a plus for you depending on your use case; you might just want something simple and powerful that can be set up in a few clicks, more of a set it and forget it kind of solution.
ElasticSearch however, as a full on open source project, has a huge ecosystem and community. The levels of customization and extensibility are substantial. Again, if this is what you’re after over the open turnkey CloudSearch, then you’re good to go.
Your particular use case also plays a big factor in which managed solution makes the most sense. CloudSearch is more of a focused solution for search; Amazon even uses it itself to power search for amazon.com. ElasticSearch is more of an extensible framework built around the seach core, applicable for analysis, dashboards, visualizations, and plugging into other solutions in the ecosystem for content management or business intelligence.
With any project dealing with handling data, security is a huge concern. Luckily, both solutions come with security baked in. Both have access control built in, particularly with CloudSearch authentication and authorization backed by IAM, allowing for a security model in line with all other AWS services. ElasticSearch has its own security practices using a plugin called Shield, allowing for integration with many IdPs along with encryption, auditing, IP filtering, and other security controls.
Lastly, for pricing CloudSearch is slightly more expensive per hour for similarly sized instance types. This is a tricky comparison however, since the instance types aren’t exactly an apple to apple comparison.
The differences between CloudSearch and ElasticSearch are subtle at first, but on further investigation they begin to differentiate. Regardless of the differences, they are both powerful managed tools for developers needing ways to manage their users searching and interacting with large amounts of data. Hopefully we’ve uncovered enough to point you in the right direction to start successfully solving your search sorrows!