Homework 2
Given out 9/25/13, due 10/9/13.In this assignment you will use a cluster assembled with cloud services to analyze a large volume of data. We will give you part of the code for a text search system with two phases: first building an index, and then querying it. For test data we will use the Common Crawl data set, which consists of about 6 billion web documents stored in Amazon's Simple Storage Service (S3).
Our code uses several Julia packages: TextAnalysis, Blocks, AWS, HTTPClient, and GZip. There is also an interface specifically for the Common Crawl data set. As such, this is a bit of a tour de force, but don't worry: you will not have to learn these libraries in detail. Rather the goal is to understand how the pieces fit together, and learn a bit about web-based computing infrastructure.
1. Exploring the data
Go to the EC2 management console
and create a "head node" for yourself from the AMI called HW2_18337.
(Click on "AMIs" in the sidebar, then right click on HW2_18337 and select
"Launch"). We recommend using instance type m1.medium for the head node.
Name this instance with your name.
Connect to your instance from a Unix machine using
ssh -i juclass.pem ubuntu@ADDRESS
where ADDRESS is the "Public DNS" entry in the instance information. On a non-unix machine, make sure juclass.pem is provided as your private key.
From there you will be able to run the rest of the commands described here.
Please "stop" your head node when you're not using it; your data will be saved. Remember that picking "terminate" will delete your data, so be careful.
All the code is in the directory HW2 in the home directory (/home/ubuntu).
In julia (just run julia) the following commands yield an object for interacting with the data:
using CommonCrawl cc = CrawlCorpus("/mnt/cc",true) # "true" supplies debug output
A CrawlCorpus is constructed with a directory to use as cache space. Files will be downloaded automatically from S3 into this local cache as necessary, and from then on read directly from the cache. You are encouraged to glance at the code for CommonCrawl to see how this works. It is basically just fetching URLs using HTTPClient, but the details are fiddly.
Since this data set is so large, it is divided first into "segments", each of which has
a numeric identifier like "1346823845675". Each segment contains many
"archives", each of which has a URL like http://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1350433107106/1350520862923_5.arc.gz. The .gz
extension indicates that the data is compressed. Each archive contains information
about many web pages.
We can list segments and archives as follows:
segment_names = segments(cc) archive_uris = archives(cc, "1346823845675")
You can also call archives(cc, n) to get the addresses of n archives, over all segments.
An archive provides a file-like interface:
arc = open(cc, archive_uris[1])
gives you a handle that can be used to read entries:
entry = read_entry(cc, arc)
A single entry has fields uri (the URI of the original web document), mime (the mime type of the data), and data (a byte array of raw data). If the data is ASCII or UTF-8 text, you can convert it to a string with bytestring(entry.data).
2. Exploring the code
The overall strategy of our indexer is to have each node in the cluster process a
subset of the archives we want to search. Each node stores the index files it
generates to its local disk. To perform a search, each node loads its indexes
and searches (a "map" process), then the results are combined (a "reduce" process) and
presented to the user.
Most of the code for this assignment was developed by prolific Julia contributor Tanmay Mohapatra. The code consists of a few short files:
- ccconsts.jl imports the needed libraries and defines the file paths to use
locally.
- ccutils.jl defines as_serialized, which can save an arbitrary
Julia object to disk, and as_deserialized, which reverses the process.
There is also preprocess, which is serial code to invoke many
text cleaning passes on the data to make it easier to analyze.
- ccindexer.jl generates indexes of crawl data. archive_to_index
accepts a single archive URI and returns an "inverse index", which maps words to
lists of documents containing them. The code is serial, and runs on a single node.
You will add parallelism in create_index, which creates indexes for many
archives. As the indexes are generated, they should be written to local disk.
- ccsearcher.jl implements searching. search_part_idx searches for query terms in a single part of the index. search_index does this in parallel and combines results. Each node should operate on its local index data.
The index returned by archive_to_index internally maps words to lists of integer document IDs. However, these document IDs are not globally unique and only make sense within a single index. Therefore we also include an array of document names (URLs) that can be used to map indexes back to names. The names are globally unique. Be sure to return document names from search_part_idx.
The function get(index, term, default) is used to look up a list of document IDs given a term. The Julia type IntSet is useful for manipulating sets of integers.
search_part_idx should accept an optional third argument "and" or "or", specifying how multiple search terms should be combined.
If you want to try parts of the code, include hw2.jl interactively, which will load the other needed files.
For convenience, you can also browse the code here:
ccconsts.jl
ccindexer.jl
ccsearcher.jl
ccutils.jl
Utility scripts:
startup.jl
shutdown.jl
runindexer.jl
runsearcher.jl
3. What to do
Read through all the code (it is less than 200 lines). Read the comments especially
carefully.
Write the code to parallelize indexing in ccindexer.jl.
Implement the logic for searching for multiple terms in search_part_index in ccsearcher.jl.
Implement parallel searching in search_index in ccsearcher.jl.
We recommend developing on one machine (your head node or local machine) and testing interactively. Calling create_index(n) should create an index of n archives, and search_index("some words", "and") will perform a search.
When you're done, move to the next section.
4. Scaling up
You can edit HW2/ccconsts.jl to set the instance type to use, and the number
of instances and number of workers per instance. All the other constants can be left
alone.
To start the cluster described by HW2/ccconsts.jl, run the shell command (in the HW2 directory)
julia ./startup.jl YOURNAME
substituting your name, which will be used to identify your cluster. This will take a couple minutes, and you will see messages as the script waits to try to connect to your instances.
When you're done with your cluster, run
julia ./shutdown.jl YOURNAME
To do a large indexing job, we provide a runindexer.jl script:
julia runindexer.jl YOURNAME NUM_ARCHIVES
This connects to the named cluster and runs your code in parallel.
Use the following command to search:
julia runsearcher.jl YOURNAME "search terms"
Submit your code along with some example searches. Also provide the number of processes you ran, the number of archives indexed, and the time taken to do so.
5. Credits
This homework would not have been possible without:
Tanmay Mohapatra - Blocks, CommonCrawl
Amit Murthy - AWS library, HTTPClient
John Myles White - TextAnalysis (and much, much more)
Kevin Squire - GZip
and many others.
6. Errata
1. Arguments to search_index and search_part_idx
search_index should also accept an op argument and pass it
through to search_part_idx.
2. Making sure the needed temp directories exist
In batch runs the startup.jl script handles creating the necessary
directories. If you are running on a single machine, you can use these
commands to create the directories manually:
sudo mkdir -p /mnt/cc/part_idx /mnt/cc/docs sudo chmod 777 /mnt/cc /mnt/cc/part_idx /mnt/cc/docs
3. Per-node index files
The call to Block in search_index may not do what you want.
You don't have to use it; you may use an alternative mechanism like readdir
to list files.
Running one process per node is acceptable (though in that case you should use single-core EC2 instances).