Community Recognition | Guidelines | Help |
|
This two-part series of articles illustrates how to build applications using Amazon Web Services by describing how Alexa's GrepTheWeb service was built.
By Jinesh Varia, Amazon Web Services
Cloud Architectures are designs of software applications that use Internet-accessible on-demand services. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources and often dispose themselves after the job is done. While in operation the application scales up or down elastically based on resource needs.
In the first section, we describe an example of an application that is currently in production using the on-demand infrastructure provided by Amazon Web Services. This application allows a developer to do pattern-matching across millions of web documents. The application brings up hundreds of virtual servers on-demand, runs a parallel computation on them using an open source distributed processing framework called Hadoop, then shuts down all the virtual servers releasing all its resources back to the cloud—all with low programming effort and at a very reasonable cost for the caller.
In the second section, we discuss some best practices for using each Amazon Web Service - Amazon S3, Amazon SQS, Amazon SimpleDB and Amazon EC2 - to build an industrial-strength scalable application.
Cloud Architectures address key difficulties surrounding large-scale data processing. In traditional data processing it is difficult to get as many machines as an application needs. Second, it is difficult to get the machines when one needs them. Third, it is difficult to distribute and co-ordinate a large-scale job on different machines, run processes on them, and provision another machine to recover if one machine fails. Fourth, it is difficult to auto-scale up and down based on dynamic workloads. Fifth, it is difficult to get rid of all those machines when the job is done. Cloud Architectures solve such difficulties.
Applications built on Cloud Architectures run in-the-cloud where the physical location of the infrastructure is determined by the provider. They take advantage of simple APIs of Internet-accessible services that scale on-demand, that are industrial-strength, where the complex reliability and scalability logic of the underlying services remains implemented and hidden inside-the-cloud. The usage of resources in Cloud Architectures is as needed, sometimes ephemeral or seasonal, thereby providing the highest utilization and optimum bang for the buck.
There are some clear business benefits to building applications using Cloud Architectures. A few of these are listed here:
There are plenty of examples of applications that could utilize the power of Cloud Architectures. These range from back-office bulk processing systems to web applications. Some are listed below:
In this paper, we will discuss one application example in detail - code-named as “GrepTheWebâ€.
The Alexa Web Search web service allows developers to build customized search engines against the massive data that Alexa crawls every night. One of the features of their web service allows users to query the Alexa search index and get Million Search Results (MSR) back as output. Developers can run queries that return up to 10 million results.
The resulting set, which represents a small subset of all the documents on the web, can then be processed further using a regular expression language. This allows developers to filter their search results using criteria that are not indexed by Alexa (Alexa indexes documents based on fifty different document attributes) thereby giving the developer power to do more sophisticated searches. Developers can run regular expressions against the actual documents, even when there are millions of them, to search for patterns and retrieve the subset of documents that matched that regular expression.
This application is currently in production at Amazon.com and is code-named GrepTheWeb because it can “grep†(a popular Unix command-line utility to search patterns) the actual web documents. GrepTheWeb allows developers to do some pretty specialized searches like selecting documents that have a particular HTML tag or META tag or finding documents with particular punctuations (“Hey!â€, he said. “Why Wait?â€), or searching for mathematical equations (“f(x) = ∑x + Wâ€), source code, e-mail addresses or other patterns such as “(dis)integration of lifeâ€.
While the functionality is impressive, for us the way it was built is even more so. In the next section, we will zoom in to see different levels of the architecture of GrepTheWeb.
Figure 1 shows a high-level depiction of the architecture. The output of the Million Search Results Service, which is a sorted list of links and gzipped (compressed using the Unix gzip utility) in a single file, is given to GrepTheWeb as input. It takes a regular expression as a second input. It then returns a filtered subset of document links sorted and gzipped into a single file. Since the overall process is asynchronous, developers can get the status of their jobs by calling GetStatus() to see whether the execution is completed.
Performing a regular expression against millions of documents is not trivial. Different factors could combine to cause the processing to take lot of time:
Hence, the design goals of GrepTheWeb included to scale in all dimensions (more powerful pattern-matching languages, more concurrent users of common datasets, larger datasets, better result qualities) while keeping the costs of processing down.
The approach was to build an application that not only scales with demand, but also without a heavy upfront investment and without the cost of maintaining idle machines (“downbottomâ€). To get a response in a reasonable amount of time, it was important to distribute the job into multiple tasks and to perform a Distributed Grep operation that runs those tasks on multiple nodes in parallel.
Figure 1 : GrepTheWeb Architecture - Zoom Level 1
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fservlet%2FKbServlet%2FdownloadImage%2F1632-102-184%2Ffigure1.png)
Figure 2: GrepTheWeb Architecture - Zoom Level 2
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fservlet%2FKbServlet%2FdownloadImage%2F1632-102-185%2Ffigure2.png)
Zooming in further, GrepTheWeb architecture looks like as shown in Figure 2 (above). It uses the following AWS components:
GrepTheWeb is modular. It does its processing in four phases as shown in figure 3. The launch phase is responsible for validating and initiating the processing of a GrepTheWeb request, instantiating Amazon EC2 instances, launching the Hadoop cluster on them and starting all the job processes. The monitor phase is responsible for monitoring the EC2 cluster, maps, reduces, and checking for success and failure. The shutdown phase is responsible for billing and shutting down all Hadoop processes and Amazon EC2 instances, while the cleanup phase deletes Amazon SimpleDB transient data.
Figure 3: Phases of GrepTheWeb Architecture
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fservlet%2FKbServlet%2FdownloadImage%2F1632-102-186%2Ffigure3.png)
Figure 4: GrepTheWeb Architecture - Zoom Level 3
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fservlet%2FKbServlet%2FdownloadImage%2F1632-102-187%2Ffigure4.png)
Detailed Workflow for Figure 4:
Users can execute GetStatus on the service endpoint to get the status of the overall system (all controllers and Hadoop) and download the filtered results from Amazon S3 after completion.
In the next four subsections we present rationales of use and describe how GrepTheWeb uses AWS services.
In GrepTheWeb, Amazon S3 acts as an input as well as an output data store. The input to GrepTheWeb is the web itself (compressed form of Alexa’s Web Crawl), stored on Amazon S3 as objects and updated frequently. Because the web crawl dataset can be huge (usually in terabytes) and always growing, there was a need for a distributed, bottomless persistent storage. Amazon S3 proved to be a perfect fit.
Amazon SQS was used as message-passing mechanism between components. It acts as “glue†that wired different functional components together. This not only helped in making the different components loosely coupled, but also helped in building an overall more failure resilient system.
If one component is receiving and processing requests faster than other components (an unbalanced producer consumer situation), buffering will help make the overall system more resilient to bursts of traffic (or load). Amazon SQS acts as a transient buffer between two components (controllers) of the GrepTheWeb system. If a message is sent directly to a component, the receiver will need to consume it at a rate dictated by the sender. For example, if the billing system was slow or if the launch time of the Hadoop cluster was more than expected, the overall system would slow down, as it would just have to wait. With message queues, sender and receiver are decoupled and the queue service smoothens out any “spiky†message traffic.
Interaction between any two controllers in GrepTheWeb is through messages in the queue and no controller directly calls any other controller. All communication and interaction happens by storing messages in the queue (en-queue) and retrieving messages from the queue (de-queue). This makes the entire system loosely coupled and the interfaces simple and clean. Amazon SQS provided a uniform way of transferring information between the different application components. Each controller’s function is to retrieve the message, process the message (execute the function) and store the message in other queue while they are completely isolated from others.
As it was difficult to know how much time each phase would take to execute (e.g., the launch phase decides dynamically how many instances need to start based on the request and hence execution time is unknown) Amazon SQS helped in building asynchronous systems. Now, if the launch phase takes more time to process or the monitor phase fails, the other components of the system are not affected and the overall system is more stable and highly available.
One use for a database in Cloud Architectures is to track statuses. Since the components of the system are asynchronous, there is a need to obtain the status of the system at any given point in time. Moreover, since all components are autonomous and discrete there is a need for a query-able datastore that captures the state of the system.
Because Amazon SimpleDB is schema-less, there is no need to define the structure of a record beforehand. Every controller can define its own structure and append data to a “job†item. For example: For a given job, “run email address regex over 10 million documentsâ€, the launch controller will add/update the â€launch_status†attribute along with the â€launch_starttimeâ€, while the monitor controller will add/update the “monitor_status†and â€hadoop_status†attributes with enumeration values (running, completed, error, none). A GetStatus() call will query Amazon SimpleDB and return the state of each controller and also the overall status of the system.
Component services can query Amazon SimpleDB anytime because controllers independently store their states–one more nice way to create asynchronous highly-available services. Although, a simplistic approach was used in implementing the use of Amazon SimpleDB in GrepTheWeb, a more sophisticated approach, where there was complete, almost real-time monitoring would also be possible. For example, storing the Hadoop JobTracker status to show how many maps have been performed at a given moment.
Amazon SimpleDB is also used to store active Request IDs for historical and auditing/billing purposes.
In summary, Amazon SimpleDB is used as a status database to store the different states of the components and a historical/log database for querying high performance data.
In GrepTheWeb, all the controller code runs on Amazon EC2 Instances. The launch controller spawns master and slave instances using a pre-configured Amazon Machine Image (AMI). Since the dynamic provisioning and decommissioning happens using simple web service calls, GrepTheWeb knows how many master and slave instances needs to be launched.
The launch controller makes an educated guess, based on reservation logic, of how many slaves are needed to perform a particular job. The reservation logic is based on the complexity of the query (number of predicates etc) and the size of the input dataset (number of documents to be searched). This was also kept configurable so that we can reduce the processing time by simply specifying the number of instances to launch.
After launching the instances and starting the Hadoop cluster on those instances, Hadoop will appoint a master and slaves, handles the negotiating, handshaking and file distribution (SSH keys, certificates) and runs the grep job.
Hadoop is an open source distributed processing framework that allows computation of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.
It typically works in three phases. A map phase transforms the input into an intermediate representation of key value pairs, a combine phase (handled by Hadoop itself) combines and sorts by the keys and a reduce phase recombines the intermediate representation into the final output. Developers implement two interfaces, Mapper and Reducer, while Hadoop takes care of all the distributed processing (automatic parallelization, job scheduling, job monitoring, and result aggregation).
In Hadoop, there’s a master process running on one node to oversee a pool of slave processes (also called workers) running on separate nodes. Hadoop splits the input into chunks. These chunks are assigned to slaves, each slave performs the map task (logic specified by user) on each pair found in the chunk and writes the results locally and informs the master of the completed status. Hadoop combines all the results and sorts the results by the keys. The master then assigns keys to the reducers. The reducer pulls the results using an iterator, runs the reduce task (logic specified by user), and sends the “final†output back to distributed file system.
Figure 5: Map Reduce Operation (in GrepTheWeb)
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fservlet%2FKbServlet%2FdownloadImage%2F1632-102-188%2Ffigure5.png)
Hadoop suits well the GrepTheWeb application. As each grep task can be run in parallel independently of other grep tasks using the parallel approach embodied in Hadoop is a perfect fit.
For GrepTheWeb, the actual documents (the web) are crawled ahead of time and stored on Amazon S3. Each user starts a grep job by calling the StartGrep function at the service endpoint. When triggered, masters and slave nodes (Hadoop cluster) are started on Amazon EC2 instances. Hadoop splits the input (document with pointers to Amazon S3 objects) into multiple manageable chunks of 100 lines each and assign the chunk to a slave node to run the map task. The map task reads these lines and is responsible for fetching the files from Amazon S3, running the regular expression on them and writing the results locally. If there is no match, there is no output. The map tasks then passes the results to the reduce phase which is an identity function (pass through) to aggregate all the outputs. The “final†output is written back to Amazon S3.
Example
Regular Expression
“A(.*)zonâ€
Format of the line in the Input dataset
[URL] [Title] [charset] [size] [S3 Object Key of .gz file] [offset]
http://www.amazon.com/gp/browse.html?node=3435361 Amazon Web us-ascii 3509 /2008/01/08/51/1/51_1_20080108072442_crawl100.arc.gz 70150864
Mapper Implementation
Reducer Implementation - Pass-through (Built-in Identity Function) and write the results back to S3.
Each of these points is discussed further in the context of GrepTheWeb.
The GrepTheWeb application uses highly-scalable components of the Amazon Web Services infrastructure that not only scale on-demand, but also are charged for on-demand.
All components of GrepTheWeb expose a service interface that defines the functions and can be called using HTTP requests and get back XML responses. For programming convenience small client libraries wrap and abstract the service specific code.
Each component is independent from the others and scales in all dimensions. For example, if thousands of requests hit Amazon SimpleDB, it can handle the demand because it is designed to handle massive parallel requests.
Likewise, distributed processing frameworks like Hadoop are designed to scale. Hadoop automatically distributes jobs, resumes failed jobs, and runs on multiple nodes to process terabytes of data.
The GrepTheWeb team built a loosely coupled system using messaging queues. If a queue/buffer is used to "wire" any two components together, it can support concurrency, high availability and load spikes. As a result, the overall system continues to perform even if parts of components become unavailable. If one component dies or becomes temporarily unavailable, the system will buffer the messages and get them processed when the component comes back up.
Figure 6: Loose Coupling – Independent Phases
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fservlet%2FKbServlet%2FdownloadImage%2F1632-102-189%2Ffigure6.png)
In GrepTheWeb, for example, if lots of requests suddenly reach the server (an Internet-induced overload situation) or the processing of regular expressions takes a longer time than the median (slow response rate of a component), the Amazon SQS queues buffer the requests durably so those delays do not affect other components.
As in a multi-tenant system is important to get statuses of message/request, GrepTheWeb supports it. It does it by storing and updating the status of your each request in a separate query-able data store. This is achieved using Amazon SimpleDB. This combination of Amazon SQS for queuing and Amazon SimpleDB for state management helps achieve higher resilience by loose coupling.
In this â€era of tera†and multi-core processors, when programming we ought to think multi-threaded processes.
In GrepTheWeb, wherever possible, the processes were made thread-safe through a share-nothing philosophy and were multi-threaded to improve performance. For example, objects are fetched from Amazon S3 by multiple concurrent threads as such access is faster than fetching objects sequentially one at the time.
If multi-threading is not sufficient, think multi-node. Until now, parallel computing across large cluster of machines was not only expensive but also difficult to achieve. First, it was difficult to get the funding to acquire a large cluster of machines and then once acquired, it was difficult to manage and maintain them. Secondly, after it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, store and access large datasets. Parallelization was not easy and job scheduling was error-prone. Moreover, if nodes failed, detecting them was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.
But now, computing has changed. With the advent of Amazon EC2, provisioning a large number of compute instances is easy. A cluster of compute instances can be provisioned within minutes with just a few API calls and decommissioned as easily. With the arrival of distributed processing frameworks like Hadoop, there is no need for high-caliber, parallel computing consultants to deploy a parallel application. Developers with no prior experience in parallel computing can implement a few interfaces in few lines of code, and parallelize the job without worrying about job scheduling, monitoring or aggregation.
In GrepTheWeb each building-block component is accessible via the Internet using web services, reliably hosted in Amazon’s datacenters and available on-demand. This means that the application can request more resources (servers, storage, databases, queues) or relinquish them whenever needed.
A beauty of GrepTheWeb is its almost-zero-infrastructure before and after the execution. The entire infrastructure is instantiated in the cloud triggered by a job request (grep) and then is returned back to the cloud, when the job is done. Moreover, during execution, it scales on-demand; i.e. the application scales elastically based on number of messages and the size of the input dataset, complexity of regular expression and so-forth.
For GrepTheWeb, there is reservation logic that decides how many Hadoop slave instances to launch based on the complexity of the regex and the input dataset. For example, if the regular expression does not have many predicates, or if the input dataset has just 500 documents, it will only spawn 2 instances. However, if the input dataset is 10 million documents, it will spawn up to 100 instances.
Rule of thumb: Be a pessimist when using Cloud Architectures; assume things will fail. In other words, always design, implement and deploy for automated recovery from failure.
In particular, assume that your hardware will fail. Assume that outages will occur. Assume that some disaster will strike your application. Assume that you will be slammed with more requests per second some day. By being pessimist, you end up thinking about recovery strategies during design time, which helps in designing an overall system better. For example, the following strategies can help in event of adversity:
Good cloud architectures should be impervious to reboots and re-launches. In GrepTheWeb, by using a combination of Amazon SQS and Amazon SimpleDB, the overall controller architecture is more resilient. For instance, if the instance on which controller thread was running dies, it can be brought up and resume the previous state as if nothing had happened. This was accomplished by creating a pre-configured Amazon Machine Image, which when launched dequeues all the messages from the Amazon SQS queue and their states from the Amazon SimpleDB domain item on reboot.
If a task tracker (slave) node dies due to hardware failure, Hadoop reschedules the task on another node automatically. This fault-tolerance enables Hadoop to run on large commodity server clusters overcoming hardware failures.
We ran several tests. Email Address Regular Expression was ran against 10 million documents. While 48 concurrent instances took 21 minutes to process, 92 concurrent instances took less than 6 min to process. This time includes instance launch time and start time of the Hadoop cluster. The total cost for 48 instances was around $5 and 92 instances was less than $10.
Instead of building your applications on fixed and rigid infrastructures, Cloud Architectures provide a new way to build applications on on-demand infrastructures.
GrepTheWeb demonstrates how such applications can be built.
Without having any upfront investment, we were able to run a job massively distributed on multiple nodes in parallel and scale incrementally based on the demand (users, size of the input dataset). With no idle time, the application infrastructure was never underutilized.
In the next section, we will learn how each of the Amazon Infrastructure Service (Amazon EC2, Amazon S3, Amazon SimpleDB and Amazon SQS) was used and we will share with you some of the lessons learned and some of the best practices.
Special Thanks to Kenji Matsuoka and Tinou Bao – the core team that developed the GrepTheWeb Architecture.
Amazon SimpleDB White Papers
Amazon SQS White paper
Hadoop Wiki
Hadoop Website
Distributed Grep Examples
Map Reduce Paper
Blog: Taking Massive Distributed Computing to the Common man – Hadoop on Amazon EC2/S3
![[image]](http://mowser.com/img?url=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fimages%2Famazon%2Freviews-button.gif)
You are viewing a mobilized version of this site...
View original page here