Homework 2

Given out 9/25/13, due 10/9/13.

In this assignment you will use a cluster assembled with cloud services to analyze a large volume of data. We will give you part of the code for a text search system with two phases: first building an index, and then querying it. For test data we will use the Common Crawl data set, which consists of about 6 billion web documents stored in Amazon's Simple Storage Service (S3).

Our code uses several Julia packages: TextAnalysis, Blocks, AWS, HTTPClient, and GZip. There is also an interface specifically for the Common Crawl data set. As such, this is a bit of a tour de force, but don't worry: you will not have to learn these libraries in detail. Rather the goal is to understand how the pieces fit together, and learn a bit about web-based computing infrastructure.

1. Exploring the data


Go to the EC2 management console and create a "head node" for yourself from the AMI called HW2_18337. (Click on "AMIs" in the sidebar, then right click on HW2_18337 and select "Launch"). We recommend using instance type m1.medium for the head node. Name this instance with your name.

Connect to your instance from a Unix machine using

ssh -i juclass.pem ubuntu@ADDRESS

where ADDRESS is the "Public DNS" entry in the instance information. On a non-unix machine, make sure juclass.pem is provided as your private key.


From there you will be able to run the rest of the commands described here.

Please "stop" your head node when you're not using it; your data will be saved. Remember that picking "terminate" will delete your data, so be careful.

All the code is in the directory HW2 in the home directory (/home/ubuntu).

In julia (just run julia) the following commands yield an object for interacting with the data:

using CommonCrawl
cc = CrawlCorpus("/mnt/cc",true)   # "true" supplies debug output

A CrawlCorpus is constructed with a directory to use as cache space. Files will be downloaded automatically from S3 into this local cache as necessary, and from then on read directly from the cache. You are encouraged to glance at the code for CommonCrawl to see how this works. It is basically just fetching URLs using HTTPClient, but the details are fiddly.


Since this data set is so large, it is divided first into "segments", each of which has a numeric identifier like "1346823845675". Each segment contains many "archives", each of which has a URL like http://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1350433107106/1350520862923_5.arc.gz. The .gz extension indicates that the data is compressed. Each archive contains information about many web pages.

We can list segments and archives as follows:

segment_names = segments(cc)
archive_uris = archives(cc, "1346823845675")

You can also call archives(cc, n) to get the addresses of n archives, over all segments.

An archive provides a file-like interface:

arc = open(cc, archive_uris[1])

gives you a handle that can be used to read entries:

entry = read_entry(cc, arc)

A single entry has fields uri (the URI of the original web document), mime (the mime type of the data), and data (a byte array of raw data). If the data is ASCII or UTF-8 text, you can convert it to a string with bytestring(entry.data).

2. Exploring the code


The overall strategy of our indexer is to have each node in the cluster process a subset of the archives we want to search. Each node stores the index files it generates to its local disk. To perform a search, each node loads its indexes and searches (a "map" process), then the results are combined (a "reduce" process) and presented to the user.

Most of the code for this assignment was developed by prolific Julia contributor Tanmay Mohapatra. The code consists of a few short files:


The index returned by archive_to_index internally maps words to lists of integer document IDs. However, these document IDs are not globally unique and only make sense within a single index. Therefore we also include an array of document names (URLs) that can be used to map indexes back to names. The names are globally unique. Be sure to return document names from search_part_idx.

The function get(index, term, default) is used to look up a list of document IDs given a term. The Julia type IntSet is useful for manipulating sets of integers.

search_part_idx should accept an optional third argument "and" or "or", specifying how multiple search terms should be combined.

If you want to try parts of the code, include hw2.jl interactively, which will load the other needed files.

For convenience, you can also browse the code here:
ccconsts.jl
ccindexer.jl
ccsearcher.jl
ccutils.jl

Utility scripts:
startup.jl
shutdown.jl
runindexer.jl
runsearcher.jl

3. What to do


Read through all the code (it is less than 200 lines). Read the comments especially carefully.

Write the code to parallelize indexing in ccindexer.jl.

Implement the logic for searching for multiple terms in search_part_index in ccsearcher.jl.

Implement parallel searching in search_index in ccsearcher.jl.

We recommend developing on one machine (your head node or local machine) and testing interactively. Calling create_index(n) should create an index of n archives, and search_index("some words", "and") will perform a search.

When you're done, move to the next section.

4. Scaling up


You can edit HW2/ccconsts.jl to set the instance type to use, and the number of instances and number of workers per instance. All the other constants can be left alone.

To start the cluster described by HW2/ccconsts.jl, run the shell command (in the HW2 directory)

julia ./startup.jl YOURNAME

substituting your name, which will be used to identify your cluster. This will take a couple minutes, and you will see messages as the script waits to try to connect to your instances.


When you're done with your cluster, run

julia ./shutdown.jl YOURNAME

To do a large indexing job, we provide a runindexer.jl script:

julia runindexer.jl YOURNAME NUM_ARCHIVES

This connects to the named cluster and runs your code in parallel.

Use the following command to search:

julia runsearcher.jl YOURNAME "search terms"

Submit your code along with some example searches. Also provide the number of processes you ran, the number of archives indexed, and the time taken to do so.

5. Credits


This homework would not have been possible without:

Tanmay Mohapatra - Blocks, CommonCrawl

Amit Murthy - AWS library, HTTPClient

John Myles White - TextAnalysis (and much, much more)

Kevin Squire - GZip

and many others.

6. Errata


1. Arguments to search_index and search_part_idx
search_index should also accept an op argument and pass it through to search_part_idx.

2. Making sure the needed temp directories exist
In batch runs the startup.jl script handles creating the necessary directories. If you are running on a single machine, you can use these commands to create the directories manually:

sudo mkdir -p /mnt/cc/part_idx /mnt/cc/docs
sudo chmod 777 /mnt/cc /mnt/cc/part_idx /mnt/cc/docs

3. Per-node index files
The call to Block in search_index may not do what you want. You don't have to use it; you may use an alternative mechanism like readdir to list files.

Running one process per node is acceptable (though in that case you should use single-core EC2 instances).