This is the fifth post in our Data Visualization Spotlight series where we showcase how different organizations are using data visualization and analytics to solve their day to day problems.
Known as “the SMS of the Internet”, this 140-character online social networking and micro blogging service revolutionized the way we connect with people online. As on September 2013, the company’s data showed that 200 million users send over 400 million tweets daily.
At Twitter, they have to deal with massive data sets daily. To analyze these data sets, their engineers create complex workflows using a variety of tools and languages, including Pig and Scalding. One difficulty many of them face when using these tools is visibility—when a Pig script is executed, multiple MapReduce (Related read: MapReduce) jobs might be launched, either in parallel or in a serial fashion if one job depends on the output of another. As these jobs run, the status of individual jobs can be monitored with the Hadoop Job Tracker UI, but overall progress of the script can be difficult to monitor.
Table of Contents
Ambrose, visualizing and monitoring large scale data workflows
Ambrose was born at one of Twitter’s quarterly held Hack Week. Its creators Bill Graham and Andy Schlaikjer wanted to have a platform that would allow visualization and real-time monitoring of large scale data workflows. Ambrose presents a global view of all the MapReduce jobs derived from workflows after planning and optimization. As jobs are submitted for execution on the Hadoop cluster, Ambrose updates its visualization to reflect the latest job status. Ambrose provides the following in a web UI:- A workflow progress bar depicting percent completion of the entire workflow
- A table view of all workflow jobs, along with their current state
- A graph diagram which depicts job dependencies and metrics
- Visual weighting of jobs based on resource consumption
- Visual weighting of job dependencies based on data volume
- Script view with line highlighting