Application Performance Management (APM) is monitoring and management of performance and availability of software applications. The interpretation of APM can vary for different people and businesses. A very basic and most important reason for monitoring your Infrastructure and Application is achieving 100% uptime for your customers and stakeholders. Multiple applications have been built over time to allow developers to achieve the same.
For reading more on application performance management , visit.
Different organizations use different tools as per their requirements. With multiple solutions available at hand, it is tough to pick one since each of them have their pros and cons. At Shadowfax, we have tried a few as well and as our application traffic increased over time, we wanted to setup more detailed alerts, like if error count of our APIs is higher than a certain threshold or the average response times of our tasks.
A great start with New Relic
During our early days, we focused more on building features for our customers and for internal processes and decided to start with New Relic APM Lite as our Monitoring Tool. It helped us to monitor our complete application performance and lot of issues were rectified to improve our overall response times. As a trial account, we were allowed to monitor our whole application i.e both web and non-web components.
As our application grew and our trial period with New Relic got over, we started to miss a lot of insights. We had no way to keep track of our servers and our production servers would go down without our knowledge. Production Issues were reported mostly from our on-ground team when their applications stopped working. Even tracking just the disk usage was also hard and resulted in downtime multiple times.
Back to the roots, we set up parsing over our server logs and inspected them each time something bad happened. Some of the usual problems that happened but went unreported included the following:
- Failure in Computational Resource. Increase of CPU Utilization, Memory Issue, Disks Usage increasing to 100% etc
- RabbitMQ Queue unable to receive or delegate tasks
- Some database query taking too much time, causing clogging in connection pool, resulting in the application going down
Visualization and Debugging with ELK
As our infrastructure grew, we started using ELK (ELasticSearch, Logstash, Kibana) for debugging production issues. We moved to central RabbitMQ, centralized our celery nodes and created Dashboards to monitor Nginx logs, MQTT Stats, visualizations for team related metrics. We were using New Relic APM Lite along with Nagios along with ELK.
- Great Visualizations and Dashboards
- Central Logging system to debug production issues
- We stopped using Flower for monitoring celery workers
Gaps that remained
- X-Pack did not offer alerting and authentication for Basic or the open source plan
- Use of multiple monitoring tools (Nagios, ELK, Sentry, New Relic APM Lite, Flower) is hard to maintain
- Gathering data debugging production issue from multiple tools is tedious
With multiple monitoring tools to maintain, we wanted to upgrade our New Relic Subscription Plan and stop worrying. However the Pricing Plan stopped us to do it and we decided to find an Open Source Solution.
Beyond the doors of open source
With a little trial and error strategy, we decided to use Graphite with StasD and collectd. With multiple collectd plugins already available and easy integration of statsd with django it was a very easy transition. We used collectd to gather server metrics with plugins like collectd-rabbitmq, redis-collectd-plugin. To visualize our time series data for application and analytics, we used Grafana which has better visualization component that Graphite.
We also added authentication to ELK and used self hosted version of Sentry.
What we achieved:
- One tool to monitor our complete infrastructure
- A better context of our systems
- Visualizing aggregated data over time helps our decisions in tweaking our stack
- Enabling developers to add metrics for monitoring as per their need
- With ease of use, individuals across teams were able to create dashboards and alerting mechanism as per their use case
- Confidence in decisions on scaling infrastructure and troubleshooting
- Customizing the different components used for monitoring as per our need
- Clear separation of metrics from our production, staging and demo environments
- Cost effectiveness
- It was fun to set up our own thing
There is always a lot to refine and with time we would move towards clustering graphite to handle more data. We plan to stop using ElasticSearch as datasource as alerting is still not available in the current version of Grafana.
Shadowfax is India’s largest crowdsourced delivery platform with presence in 70+ cities across India and 7000+ daily active delivery personnel. Shadowfax’s unique app enables delivery of food, grocery, pharmacy and e-commerce for businesses and helps them create customer delight using technology. With relentless focus on engineering pleasant experiences for the customers, Shadowfax envisions to become the most desirable and trustworthy delivery platform for customers.