Platform
It all started with a hackathon project just a few months ago when a team of engineers gathered at the 2018 Delphix Engineering Kickoff event.
Allie Tarrant
Nov 01, 2018
Share
It all started with a hackathon project just a few months ago when a team of engineers gathered at the 2018 Delphix Engineering Kickoff event. Our goal was to improve the diagnosability of Delphix as a product, a theme that the Product Management team had been exploring for some time.
In a nutshell, the idea was to empower our customers to easily notice problems in real time and allow them to self-diagnose and solve these problems at their convenience. We wanted to ensure that all Delphix users have all the information they need, when they need it, to make the best possible decisions.
This early hackathon effort ultimately grew to become a full-fledged engineering project, which was delivered in the latest Delphix 5.3 release as native support for integrating with Splunk, a third-party platform for collecting and analyzing data. I’m going to tell the story of how we got from there to here, starting with a closer look at what diagnosability means for Delphix.
While Delphix provides a lot of useful information, one of the biggest challenges is getting this information into a format that allows rich, customizable search and cross-referencing with diagnostic information about non-Delphix infrastructure that would help track down problems. Until recently, the easiest way to see most customer-visible information has been to manually log in to the Delphix Management UI.
This approach is usually sufficient for simple cases, but our UI can only offer the search and visualization tools that we’ve built into the product ourselves. Also, when information about Delphix is confined to the Delphix UI, it narrows the opportunity for collaboration with other tools. It’s feasible to retrieve information via email, syslog and our API/CLI, but these methods aren’t flexible and often require extra work and one-off scripted solutions to get the data into the right format.
This is exacerbated by the fact that Delphix is a closed appliance, so extending functionality and expanding reporting capabilities often require customers to collaborate with our wonderful support and services staff. This process generally leads to workable solutions, but it’s something that needs to be repeated for each customer individually, with variations based on their specific needs.
Enter Splunk. Splunk is a platform that lets you index, search and visualize almost any type of data imaginable. They promise that with Splunk, you can “work the way your data works to deliver limitless insights and meaningful outcomes.” Sounds promising, right?
The messaging seems to dovetail nicely with what we hoped to accomplish in improving the diagnosability of Delphix. Best of all, we found that a majority of our existing customers already use Splunk! So what if we could just plug into their existing tools and workflows by providing an easy way to integrate Delphix with Splunk? That would take our fuzzy, aspirational concept of achieving diagnosability and turn it into a concrete, bounded goal with measurable benefits for new and existing customers alike.
Our hackathon team members all brought different skills to the table with the goal of prototyping some sort of Splunk integration with Delphix. In true hackathon spirit, we came with a willingness to learn and explore since the specifics of the project were somewhat hazy to start with, and none of us had any real experience with Splunk yet.
Thinking back to those 24 hours, I mainly remember wrestling with badly documented Java logging configuration issues and wading through cryptic errors trying to install Illumos packages, all while wrapping our heads around Splunk concepts and terminology that were new to all of us.
But we persisted, and by the end we had kludged together a rough prototype of a Delphix Engine hooked up to our trial Splunk instance, sending out structured JSON events in real-time as activity happened on the engine.
This was a great start and an exciting proof-of-concept, but we were still a far cry from delivering a customer-ready, shippable feature. Over the next few months, we assembled a team to take this idea still in its infancy and grow it into a real, live piece of software that could survive on its own out in the world.
This was a collaborative, cross-team effort. On the development side of things, we had people familiar with our Java stack (myself included), along with folks from the Systems Platform team with expertise in the OS-level changes required to support the Splunk integration effort (codenamed Delphix Insight). We worked with our DevOps organization to set up and maintain an internal Splunk Enterprise instance to develop against and brought QA engineers on board to write test automation, ensuring a high level of quality before the release.
Perhaps most importantly, we consulted with folks in the field to make the right decisions based on our customers’ needs throughout the development process and ensure we were building the right thing the right way.
Architectural Diagam
The end result was a carefully orchestrated set of moving parts that work together to accomplish our initial goal: compiling actionable data and shipping it where it needs to go as quickly and painlessly as possible, without requiring too much effort from customers. Now, instead of having to roll their own custom solution or enlist the help of our Support and Professional Services teams, our users have a single point of configuration to get the data they need, which fits natively with the rest of the Delphix product and is available to use out-of-the-box.
Single Screen Splunk Configuration in 5.3.0.0
Beyond the benefits of simply getting the information out of Delphix, one of the most useful parts of this integration is that all of the data that we send to Splunk is structured according to well-defined JSON schemas. Unlike with simple text logs, each part of the event can be independently searched and filtered. It then becomes quite easy to use Splunk to create customizable searches, graphs, and visualizations of the data.
Here’s a simple example of a short Splunk query that creates a table of the longest running actions:
Search:
Result:
Depending on which information customers care most about, these queries can be easily tweaked, expanded and combined with other important data to get at exactly what is needed for their specific use cases.
There were many other interesting design decisions that came up while developing this project, but I’ll highlight just one more. To get data from our Java stack to the customer’s Splunk instance, we decided to use Fluentd as an intermediary.
Fluentd is described as a “unified logging layer.” You can think of it as a lightweight forwarder which knows how to accept data from different sources and send it to a variety of destinations, sort of like a tiny post-office for machine data. In our case, we have a Fluentd daemon that runs on the Delphix Engine itself, accepts JSON input from the Java stack and uses the official Fluentd plugin published by Splunk to send it along to the customer’s Splunk Enterprise instance.
By using Fluentd, we got all sorts of benefits for free, such as on-disk file buffers to protect against data loss and automatic retries in case the connection to Splunk is spotty. This contributed to a robust architecture and saved us much effort compared to going directly from Java to Splunk.
More interesting, though, is Fluentd’s ability to mix-and-match various plugins and the possibilities this architecture leaves open for the future. By using Fluentd from the start, this keeps our options open and lays a foundation for further integrations in the future.
Already, we’ve seen early benefits from this decision: other engineering teams at Delphix have been able take advantage of our work as a starting point to rapidly prototype solutions in other areas. For example, we’ve seen promising efforts to integrate with Elasticsearch that could enable Delphix toolkit developers to add custom logging to toolkits. In turn, this would greatly improve the development and debugging process and make it easier and faster for folks outside of Delphix to write custom integrations, allowing new types of databases to be virtualized by Delphix.
What will we think of next? Stay tuned!