Application Performance Management (APM) is monitoring and management of performance and availability of software applications. The interpretation of APM can vary for different people and businesses. A very basic and most important reason for monitoring your Infrastructure and Application is achieving 100% uptime for your customers and stakeholders. Multiple applications have been built over time to allow developers to achieve the same.

‍For reading more on application performance management , visit.

Different organizations use different tools as per their requirements. With multiple solutions available at hand, it is tough to pick one since each of them have their pros and cons. At Shadowfax, we have tried a few as well and as our application traffic increased over time, we wanted to setup more detailed alerts, like if error count of our APIs is higher than a certain threshold or the average response times of our tasks.

‍

A great start with New Relic

During our early days, we focused more on building features for our customers and for internal processes and decided to start with New Relic APM Lite as our Monitoring Tool. It helped us to monitor our complete application performance and lot of issues were rectified to improve our overall response times. As a trial account, we were allowed to monitor our whole application i.e both web and non-web components.

Image for post

As our application grew and our trial period with New Relic got over, we started to miss a lot of insights. We had no way to keep track of our servers and our production servers would go down without our knowledge. Production Issues were reported mostly from our on-ground team when their applications stopped working. Even tracking just the disk usage was also hard and resulted in downtime multiple times.

Back to the roots, we set up parsing over our server logs and inspected them each time something bad happened. Some of the usual problems that happened but went unreported included the following:

Failure in Computational Resource. Increase of CPU Utilization, Memory Issue, Disks Usage increasing to 100% etc.
RabbitMQ Queue unable to receive or delegate tasks
Some database query taking too much time, causing clogging in connection pool, resulting in the application going down

‍

Visualization and Debugging with ELK

As our infrastructure grew, we started using ELK (ELasticSearch, Logstash, Kibana) for debugging production issues. We moved to central RabbitMQ, centralized our celery nodes and created Dashboards to monitor Nginx logs, MQTT Stats, visualizations for team related metrics. We were using New Relic APM Lite along with Nagios along with ELK.

What changed:

Great Visualizations and Dashboards
Central Logging system to debug production issues
We stopped using Flower for monitoring celery workers

Image for post

Gaps that remained

X-Pack did not offer alerting and authentication for Basic or the open source plan
Use of multiple monitoring tools (Nagios, ELK, Sentry, New Relic APM Lite, Flower) is hard to maintain
Gathering data debugging production issue from multiple tools is tedious

With multiple monitoring tools to maintain, we wanted to upgrade our New Relic Subscription Plan and stop worrying. However the Pricing Plan stopped us to do it and we decided to find an Open Source Solution.

‍

Beyond the doors of open source

With a little trial and error strategy, we decided to use Graphite with StasD and collectd. With multiple collectd plugins already available and easy integration of statsd with django it was a very easy transition. We used collectd to gather server metrics with plugins like collectd-rabbitmq, redis-collectd-plugin. To visualize our time series data for application and analytics, we used Grafana which has better visualization component that Graphite.

Image for post

We also added authentication to ELK and used self hosted version of Sentry. What we achieved:

One tool to monitor our complete infrastructure
A better context of our systems
Visualizing aggregated data over time helps our decisions in tweaking our stack
Enabling developers to add metrics for monitoring as per their need
With ease of use, individuals across teams were able to create dashboards and alerting mechanism as per their use case
Confidence in decisions on scaling infrastructure and troubleshooting
Customizing the different components used for monitoring as per our need
Clear separation of metrics from our production, staging and demo environments
Cost effectiveness
It was fun to set up our own thing

‍

What’s next

There is always a lot to refine and with time we would move towards clustering graphite to handle more data. We plan to stop using ElasticSearch as datasource as alerting is still not available in the current version of Grafana.

How we moved to open source alternative for Application Performance Management?

A great start with New Relic

Visualization and Debugging with ELK

Gaps that remained

Beyond the doors of open source

What’s next

Get in Touch