Categories: Showcase

How Twitter uses data visualization to track its complex workflows?

This is the fifth post in our Data Visualization Spotlight series where we showcase how different organizations are using data visualization and analytics to solve their day to day problems. Known as “the SMS of the Internet”, this 140-character online social networking and micro blogging service revolutionized the way we connect with people online. As on September 2013, the company’s data showed that 200 million users send over 400 million tweets daily. At Twitter, they have to deal with massive data sets daily. To analyze these data sets, their engineers create complex workflows using a variety of tools and languages, including Pig and Scalding. One difficulty many of them face when using these tools is visibility—when a Pig script is executed, multiple MapReduce (Related read: MapReduce) jobs might be launched, either in parallel or in a serial fashion if one job depends on the output of another. As these jobs run, the status of individual jobs can be monitored with the Hadoop Job Tracker UI, but overall progress of the script can be difficult to monitor.

Ambrose, visualizing and monitoring large scale data workflows

Ambrose was born at one of Twitter’s quarterly held Hack Week. Its creators Bill Graham and Andy Schlaikjer wanted to have a platform that would allow visualization and real-time monitoring of large scale data workflows. Ambrose presents a global view of all the MapReduce jobs derived from workflows after planning and optimization. As jobs are submitted for execution on the Hadoop cluster, Ambrose updates its visualization to reflect the latest job status. Ambrose provides the following in a web UI:
  • A workflow progress bar depicting percent completion of the entire workflow
  • A table view of all workflow jobs, along with their current state
  • A graph diagram which depicts job dependencies and metrics
    1. Visual weighting of jobs based on resource consumption
    2. Visual weighting of job dependencies based on data volume
  • Script view with line highlighting
Fig: In this screenshot, we see the Ambrose UI for a workflow compiled from a single Pig script. The circular chord diagram in the upper left highlights dependencies between jobs. As a job’s status changes, the color of its arc in the diagram changes. Statistics for the job most recently started are displayed to the right of the chord diagram. Summary information and status of all jobs is displayed in the table beneath these two views. Image Source: blog.twitter.com Fig: With Ambrose, the real-time status of a complex series of MapReduce jobs can be visualized succinctly, so that users can quickly understand how far computation has progressed and diagnose failures in context. Image Source: github.com/twitter/ambrose The interface presents multiple responsive “views” of a single workflow. Just beneath the toolbar at the top of the window is a workflow progress bar that tracks overall completion of the workflow. Below the progress bar is a graph diagrams which depicts the workflow’s jobs and their dependencies. Below the graph diagram is a table of workflow jobs. All views react to mouse over and click events on a job, regardless of the view on which the event is triggered. Moving your mouse over the first row of the table will highlight that job’s table row along with the job’s node in the graph diagram. Clicking on a job in any view will select it, updating the highlighting of that job in all views. Clicking again on the same job will deselect it.

Because sharing is caring—Going Open Source

Image Source: blog.twitter.com At the Apache Pig Hackathon held in May 2012, Twitter open-sourced Ambrose. Initially when it was open sourced it only worked with Pig, however with contributions from the Open Source community the framework allowed support for other runtimes like Hive, Cascading and Scalding. Fig: The open sourced version also included a graph layout of Pig EXPLAIN data. This visualization can be used to debug and better understand the Pig scripts. Image Source: Hortonworks

Final thoughts

Comprehensive visibility is the first step to managing complex workflows and Twitter’s data visualization tool Ambrose helps in providing that visibility into jobs. By providing the right context, it makes it easier for you to plan your jobs properly, monitor progress and diagnose failures well in time. In the next post of the Data Visualization Spotlight series, read how Airbnb used conditional probability models and data visualization to make its search algorithm more location relevant.

Reference:

Shilpi Choudhury

Recent Posts

AI-Powered Documentation for Data Visualization & Analytics

Have you ever spent hours buried in documentation, hunting for a specific piece of code?…

3 weeks ago

Unveiling the Hidden Gems: Top 5 AI Data Visualization Tools for 2024

Do you feel like your data is a cryptic puzzle, locked away from revealing its…

1 month ago

Unleash the Power of AI: Smart Charting for JavaScript Developers

In web development, mastering JavaScript charting libraries is crucial for keeping up with the fast-paced…

2 months ago

Focus on the Magic, Not the Mundane: Ask FusionDev AI is Here!

Ever spend an afternoon neck-deep in documentation, searching for that one elusive code snippet? Or…

2 months ago

FusionCharts 4.0: Elevate Your Data Visualization with New Capabilities

In the dynamic world of data visualization, the need for precision and innovation has never…

2 months ago

How AI is Enhancing the JavaScript Charting Experience in 2024

Are you drowning in data but struggling to find the insights that drive real business…

4 months ago