# 6.338 Midterm Project
Harriet Li (kameeko@)

In [None]:
Pkg.add("TextAnalysis")
Pkg.add("PyPlot")
Pkg.update()
using TextAnalysis
using PyPlot

The original task was to create "some kind of inventory of the word "parallel" in the google groups" and "synthesize julia's parallel computing approach". Since there is too much information to go through by hand, I make a first pass by compiling posts to the julia-users Google Group that contain the word "parallel" and issues in the Julia GitHub tagged with "parallel". The package TextAnalysis.jl is used to compute the frequency of keywords and topics among the posts.

Since I was not able to find an easy way to automatically download Google Group posts or GitHub issues (the second appears more feasible but I encountered installation issues), I manually copied and saved the 224 most relevant Google Group posts (as of 10/14/16) and newest 100 closed GitHub issues. 

First, read in the text files and create a Corpus out of them.

In [None]:
posts = Any[] #can't specify FileDocument or else Corpus complains
for i = 1:9
    filedoc = FileDocument(string("posts_cleaned/00","$(i)",".txt"))
    push!(posts,StringDocument(filedoc))
end
for ii = 10:99
    filedoc = FileDocument(string("posts_cleaned/0","$(ii)",".txt"))
    push!(posts,StringDocument(filedoc))
end
for iii = 100:224
    filedoc = FileDocument(string("posts_cleaned/$(iii)",".txt"))
    push!(posts,StringDocument(filedoc))
end

#add github issues
for i = 1:9
    filedoc = FileDocument(string("github_closed_cleaned/00","$(i)",".txt"))
    push!(posts,StringDocument(filedoc))
end
for ii = 10:99
    filedoc = FileDocument(string("github_closed_cleaned/0","$(ii)",".txt"))
    push!(posts,StringDocument(filedoc))
end
for iii = 100:100
    filedoc = FileDocument(string("github_closed_cleaned/$(iii)",".txt"))
    push!(posts,StringDocument(filedoc))
end

In [None]:
crps = Corpus(posts)

The TextAnalysis package has preprocessing tools to help one better analyze the documents in terms of the frequency of n-grams. Stripping stop-words removes common words like "all" and "almost" and helps remove words that we're not interested in from the documents in the corpus. Since this also removes some programming languages, like "C" and "R" and "Go", these words are modified beforehand using perl to avoid being stripped (ex: "C" is replaced with "Clanguage"). One can also remove specific n-grams, such as "julia-users", which appears in all posts to the Google Group.

In [None]:
#preprocessing
prepare!(crps, strip_case | strip_whitespace)
using Languages
prepare!(crps, strip_articles)
prepare!(crps, strip_definite_articles)
prepare!(crps, strip_indefinite_articles)
prepare!(crps, strip_prepositions)
prepare!(crps, strip_pronouns)
prepare!(crps, strip_punctuation) #.jl replaced with -jl in preprocessing
prepare!(crps, strip_stopwords) #kills "Go" and "C" unless preprocessed before reading in
prepare!(crps, strip_numbers) #confuses "Seed7" with other uses of "seed"
remove_words!(crps, ["julia-users"])

Creating a lexicon allows one to see the number of documents in which an n-gram appears.

In [None]:
#which posts contained a certain word
lexicon(crps)
update_lexicon!(crps)
lexicon(crps)

One can query the lexical frequency of an n-gram, which is the number of times a particular n-gram appears out of all appearances of all n-grams. This is not particularly useful in this case because many of the n-grams consist of nonsense, like mispellings, memory addresses in error messages, and variable names in snippets of code.

In [None]:
lexical_frequency(crps, "cpp") #compared to all unigrams

One can also sort the n-grams by frequency, although, again, this is not particularly useful in this case due to the large set of nonsense n-grams.

In [None]:
#sort unigrams by frequency
uni_by_freq = sort(collect(crps.lexicon), by = tuple -> last(tuple), rev=true) 

Creating an inverse index allows one to see which documents an n-gram appears in. This can serve as a glorified word-finding tool, but is used here as a normalization method. The frequency of an n-gram is described by the number of unique documents it appears in, to minimize the impact of error messages and other copy-and-pasted segments of text.

In [None]:
#see which words appeared in which posts
inverse_index(crps)
update_inverse_index!(crps)
inverse_index(crps)

In [None]:
crps["mpi-jl"]

In [None]:
crps["embarrassingly"] #this probably is part of "embarrassingly parallel"

In [None]:
crps["seedseven"]

In [None]:
#functions for plotting
function sort_for_plot(labels::Vector{String},numdocs::Vector{Int64})
    n = length(labels);
    plotme = Dict{String,Int64}(labels[ii] => numdocs[ii] for ii = 1:n);
    plotme_ordered = sort(collect(plotme), by = tuple -> last(tuple), rev=true);
    labels = [plotme_ordered[ii][1] for ii = 1:n]
    numdocs = [plotme_ordered[ii][2] for ii = 1:n]
    
    return labels, numdocs
end

function shiny_pie(values::Vector{Int64},
    labels::Vector{String},
    titlestring::String,
    startangle::Float64)
    
    my_cmap = ColorMap("Set2")
    
    fig = figure(titlestring,figsize=(8,8))
    p = pie(values,
            shadow=true,
            startangle=startangle,
            colors=my_cmap(linspace(0,1,length(values))),
            autopct="%1.1f%%",
            pctdistance=0.7
    )
    axis("equal")
    title(titlestring)
    
    foo = 0.
    for (p1, l1) in zip(p[1], labels)

        r = p1[:r]
        dr = r*0.1
        t1, t2 = p1[:theta1], p1[:theta2]
        theta = (t1+t2)/2.

        xc = cos(theta/180.*pi)*r
        yc = sin(theta/180.*pi)*r
        x1 = cos(theta/180.*pi)*(r+dr)
        y1 = sin(theta/180.*pi)*(r+dr)

        if x1 > 0.
            x1 = r+2*dr
            ha, va = "left", "center"
            cstyle="angle,angleA=180,angleB=$(-theta)"
        else
            x1 = -(r+2*dr)
            ha, va = "right", "center"
            cstyle="angle,angleA=0,angleB=$(theta)"
        end
        if foo > 0.
            if theta - foo < 10.
                y1 = y1 + 0.1
                if x1 > 0. 
                    cstyle="arc,angleA=180,armA=30,armB=10,angleB=$(-theta)"
                else
                    cstyle="arc,angleA=0,armA=30,armB=10,angleB=$(theta)"
                end
            end
        end

        foo = theta

        annotate(l1,
                (xc, yc), xycoords="data",
                xytext=(x1, y1), textcoords="data", ha=ha, va=va,
                arrowprops=Dict("arrowstyle"=>"-",
                                "connectionstyle"=>cstyle,
                                "patchB"=>p1))

    end
    
    return true
end


First, let's see what kinds of platforms and operating systems are being used to perform parralel computing.

In [None]:
#platforms
platwords = [Set{String}(["laptop","laptops"]),
    Set{String}(["desktop","desktops"]),
    Set{String}(["workstation","workstations"]),
    Set{String}(["cluster","clusters"]),
    Set{String}(["cloud"]),
    Set{String}(["server","servers"]),
    Set{String}(["juliabox"]),
    Set{String}(["network","networks"]),
    Set{String}(["remote"])]; #words to look for in lexicon
docs_plat = [Set{Int64}() for ii = 1:9]; #number of documents in which word occurs
for ii = 1:9
    for unig in platwords[ii]
        for mention in crps[unig]
            push!(docs_plat[ii],mention)
        end
    end
end
numdocs_plat = [length(docs_plat[ii]) for ii = 1:9]
labels_plat = ["Laptop","Desktop","Workstation","Cluster","Cloud","Server","JuliaBox","Network","Remote"]; #labels for plotting

#sort by frequency
labels_plat, numdocs_plat = sort_for_plot(labels_plat, numdocs_plat)
        
#plot
shiny_pie(numdocs_plat,labels_plat,"Computing Platforms Mentioned",0.)

In [None]:
#operating systems
#non-Julia libraries/frameworks
oswords = [Set{String}(["linux","ubuntu","redhat","xubuntu","debian","unix","fedora","centos"]),
    Set{String}(["windows","pc"]),
    Set{String}(["ios","mac","osx","macosx","capitan","yosemite","maverick","iphone","ipad","macbook"]),
    Set{String}(["android"]),
    Set{String}(["chrome"])]; #words to look for in lexicon
docs_os = [Set{Int64}() for ii = 1:5]; #number of documents in which word occurs
for ii = 1:5
    for unig in oswords[ii]
        for mention in crps[unig]
            push!(docs_os[ii],mention)
        end
    end
end
numdocs_os = [length(docs_os[ii]) for ii = 1:5]
labels_os = ["Linux","Windows","iOS","Android","Chrome OS"]; #labels for plotting

#sort by frequency
labels_os, numdocs_os = sort_for_plot(labels_os, numdocs_os)
        
#plot
shiny_pie(numdocs_os,labels_os,"Operating Systems Mentioned",20.0)

Now let's see what other programming languages are mentioned. From a skim of the posts, these are usually mentioned as part of a comparison (ex: "Can we get a x-like feature in Julia?") but also sometimes mentioned as something to interface with Julia (ex: "How do I call x from Julia?"). R is still particularly vulnerable to false alarms because it also appears as a variable and in command-line outputs.

In [None]:
#non-Julia programming languages (usually compared to Julia, sometimes to interface with Julia)

langwords = ["clanguage", "cpp", "python", "java", "octave", "matlab", "fortran",
    "rlang","perl","scala","javascript","php","ruby","elixir","erlang","golang","atlas",
    "seedseven","pascal","chapel","haskell"]; #words to look for in lexicon
numdocs_lang = [length(crps[unig]) for unig in langwords]; #number of documents in which word occurs
docs_lang = Set{Int64}();
for unig in langwords
    for mention in crps[unig]
        push!(docs_lang,mention)
    end
end
labels_lang = ["C", "C++", "Python", "Java", "Octave", "Matlab", "Fortran", 
    "R","Perl","Scala","Javascript","PHP","Ruby","Elixir","Erlang","Go","Atlas",
    "Seed7","Pascal","Chapel","Haskell"]; #labels for plotting

#sort by frequency
labels_lang, numdocs_lang = sort_for_plot(labels_lang, numdocs_lang)
        
#plot
shiny_pie(numdocs_lang,labels_lang,"Non-Julia Programming Languages Mentioned",20.)
    
#total percentage of posts that mentioned any non-julia programming language
println("Percentage of posts mentioning any non-Julia language: $(100*length(docs_lang)/length(crps.documents))%")

Now let's see what non-Julia libraries/frameworks are mentioned. The words searched for by no means comprise an exhaustive list; the list of what libraries to look for was generated by manually skimming the julia-users posts.

In [None]:
#non-Julia libraries/frameworks
libwords = ["openmp","mpi","mpich","blas","openblas","lapack","hdffive","hdfs",
    "opencl","cuda","hadoop","spark","flask","spray","mkl","mysql"]; #words to look for in lexicon
    #not mentioned in current sample: openmpi, pig
numdocs_lib = [length(crps[unig]) for unig in libwords];
docs_lib = Set{Int64}();
for unig in libwords
    for mention in crps[unig]
        push!(docs_lib,mention)
    end
end
labels_lib = ["OpenMP","MPI","MPICH","BLAS","OpenBLAS","LAPACK","HDF5","HDFS",
    "OpenCL","CUDA","Hadoop","Spark","Flask","Spray","MKL","MySQL"]; #labels for plotting

#sort by frequency
labels_lib, numdocs_lib = sort_for_plot(labels_lib, numdocs_lib)
        
#plot
shiny_pie(numdocs_lib,labels_lib,"Non-Julia Libraries/Frameworks Mentioned",10.)
    
#total percentage of posts that mentioned any non-julia library/framwork
println("Percentage of posts mentioning any non-Julia library/language: $(length(docs_lib)/length(crps.documents))%")

Now let's see which packages in Julia are popular in the parallel programming applications. This is done by finding all unigrams which contain the Julia package extension. This is of course not a foolproof method, as users sometimes refer to their own scripts by the .jl extension, and packages are sometimes mentioned without the .jl extension (ex: DistributedArrays, Distributed Arrays, DistributedArrays.jl). Some packages also appear to be mostly mentioned in Julia warnings or error outputs (ex: multi.jl), and don't indicate any direct user interest in them. Taking these packages and also searching for their appearances without the .jl extension can lead to many false alarms, as some have names that are also common words, expecially when case is ignored (ex: would work for DistributedArrays.jl, but not plot.jl and JuMP).

In [None]:
#Julia packages/wrappers

all_unigrams = keys(crps.lexicon)
julia_pkgs = [key for key in all_unigrams if endswith(key,"-jl") && length(key)>3]
ndocs_jpkg = [length(crps[unig]) for unig in julia_pkgs]

labels_jpkg = [replace(jpkg,"-",".") for jpkg in julia_pkgs] #labels for plotting

#sort by frequency
labels_jpkg, ndocs_jpkg = sort_for_plot(labels_jpkg, ndocs_jpkg)

common_labels_jpkg = [labels_jpkg[ii] for ii = 1:length(ndocs_jpkg) if ndocs_jpkg[ii]>2];
common_ndocs_jpkg = [ndocs_jpkg[ii] for ii = 1:length(ndocs_jpkg) if ndocs_jpkg[ii]>2];

fig = figure("most_common_julia_package_mentions",figsize=(8,8))
b = barh(-1:-1:-length(common_labels_jpkg),100.*common_ndocs_jpkg/length(crps.documents),align="center",alpha=0.4)
yticks(-1:-1:-length(common_labels_jpkg),common_labels_jpkg)
title("Most Commonly Mentioned Julia Packages \n (mentioned as .jl)")
xlabel("Percentage of Documents")

One can make a better search by including bigrams, which would allow to search for two-word phrases for which neither word alone is informative (ex: "embarrassingly" almost always shows up as part of "embarrassingly parallel", but "distributed array" is a specific object that does not make up the bulk of mentions of either "distributed" or "array").

In [None]:
biposts = Any[] #can't specify FileDocument or else Corpus complains
for i = 1:9
    filedoc = FileDocument(string("00","$(i)",".txt"))
    push!(biposts,NGramDocument(ngrams(filedoc,2)))
end
for ii = 10:99
    filedoc = FileDocument(string("0","$(ii)",".txt"))
    push!(biposts,NGramDocument(ngrams(filedoc,2)))
end
for iii = 100:224
    filedoc = FileDocument(string("$(iii)",".txt"))
    push!(biposts,NGramDocument(ngrams(filedoc,2)))
end

crps2 = Corpus(biposts)

prepare!(crps2, strip_case | strip_whitespace)
prepare!(crps2, strip_articles)
prepare!(crps2, strip_definite_articles)
prepare!(crps2, strip_indefinite_articles)
prepare!(crps2, strip_prepositions)
prepare!(crps2, strip_pronouns)
prepare!(crps2, strip_punctuation) 

lexicon(crps2)
update_lexicon!(crps2)
lexicon(crps2)

inverse_index(crps2)
update_inverse_index!(crps2)
inverse_index(crps2)

In [None]:
crps2["distributed array"]

Let's use the unigram and bigram lexicons to see how often some terms related to different aspects of parallelism are of interest to users. The DataAnalysis package has a "stemming" method that combines similar words, like "dog" and "dogs", into one word and reduces the size of the lexicon. This was not used for this demonstration.

In [None]:
#aspects of parallelism 
parwords = [Set{String}(["garbage","clean up"]),
    Set{String}(["gpu"]),
    Set{String}(["scale","scales","scalability","scaling","scaled"]),
    Set{String}(["thread","threads","multi-threading"]),
    Set{String}(["latency"]),
    Set{String}(["bandwidth"]),
    Set{String}(["sync","synchronize","synchronous","asynchronous"]),
    Set{String}(["data transfer"]),
    Set{String}(["distributed memory"]),
    Set{String}(["shared memory"])]
parlabels = ["Garbage Cleanup","GPU","Scaling","Threading","Latency","Bandwidth","Synchronization",
    "Data Transfer","Distributed Memory","Shared Memory"]

docs_par = [Set{Int64}() for ii = 1:10]; #number of documents in which word occurs
for ii = 1:10
    for ng in parwords[ii]
        if contains(ng," ")
            for mention in crps2[ng]
                push!(docs_par[ii],mention)
            end
        else
            for mention in crps[ng]
                push!(docs_par[ii],mention)
            end
        end   
    end
end
numdocs_par = [length(docs_par[ii]) for ii = 1:10]

parlabels, numdocs_par = sort_for_plot(parlabels, numdocs_par)

fig = figure("weee",figsize=(8,8))
b = barh(-1:-1:-length(parlabels),100.*numdocs_par/length(crps.documents),align="center",alpha=0.4)
yticks(-1:-1:-length(parlabels),parlabels)
title("Key Terms Related to Parallel Programming")
xlabel("Percentage of Documents")

Let's compare how often users mention objects, tasks, and libraries that are specific to either shared or distributed memory parallel programming models.

In [None]:
#shared vs distributed memory
dvswords = [Set{String}(["distributed memory","mpi","mpi-jl",
        "distributedarrays-jl","distributedarray","distributedarrays","distributed arrays"]),
    Set{String}(["shared memory","openmp","thread","threads",
        "multi-threading","multi-threaded","sharedarray","sharedarrays","sharedarray-jl","sharedvector"])]
dvslabels = ["Distributed Memory","Shared Memory"]

docs_dvs = [Set{Int64}() for ii = 1:2]; #number of documents in which word occurs
for ii = 1:2
    for ng in dvswords[ii]
        if contains(ng," ")
            for mention in crps2[ng]
                push!(docs_dvs[ii],mention)
            end
        else
            for mention in crps[ng]
                push!(docs_dvs[ii],mention)
            end
        end   
    end
end
numdocs_dvs = [length(docs_dvs[ii]) for ii = 1:2]

dvslabels, numdocs_dvs = sort_for_plot(dvslabels, numdocs_dvs)

fig = figure("argh",figsize=(8,3))
b = barh(-1:-1:-length(dvslabels),100.*numdocs_dvs/length(crps.documents),align="center",alpha=0.4)
yticks(-1:-1:-length(dvslabels),dvslabels)
title("Mentions of Terms/Phrases/Packages Specific to \n Distributed vs Shared Memory")
xlabel("Percentage of Documents")

Let's see how often posts mention integration with a non-Julia language.

In [None]:
#interfacing with languages (not specific packages)
#can't suss out interfaces with packages vs languages, and if included interfaces to packages, then there
#are many packages that are wrappers and aren't described as such...
docs_xface = Set{Int64}();
xface_unig = ["pycall","javacall","rcall","interface","wrapper","wrappers"]; #integration-related unigrams
xface_bigr = [[string(lang," intergration") for lang in langwords]; 
    [string(lang," integration") for lang in libwords]; 
    "julia integration"; "integration with"]; #integration-related bigrams; 
    #"integration with" used to avoid numerical integration, but killed by stopword removal
for unig in xface_unig
    for mention in crps[unig]
        push!(docs_xface,mention)
    end
end
for bigr in xface_bigr
    for mention in crps2[bigr]
        push!(docs_xface,mention)
    end
end
println("Percentage of posts mentioning integration with non-Julian language/package: $(100.*length(docs_xface)/length(crps.documents))%")

Let's get a general idea of what kinds of problems users are intersted in solving.

In [None]:
#problems solved using parallel programming
probwords = [Set{String}(["pde","pdes","ode","odes","dde","ddes","dae","daes","differential",
        "sde","sdes","fem","finite element"]),
    Set{String}(["monte carlo","mc","mcmc"]),
    Set{String}(["io"]),
    Set{String}(["optimization","optimisation"]),
    Set{String}(["graph","graphs"]),
    Set{String}(["machine learning"]),
    Set{String}(["database","databases","sql"]),
    Set{String}(["simulate","simulation"]),
    Set{String}(["linear","matrix","matrices","eigenvalue","eigenvalues","eigenvector","eigenvectors","trace",
        "svd","chol","cholesky","condition number","determinant","lu","norm","rank","nullspace",
        "schur","sparse","dense"])]
problabels = ["Differential Equations","Monte Carlo","I/O","Optimization","Graphs",
    "Machine Learning","Databases","Simulations","Linear Algebra"]

docs_prob = [Set{Int64}() for ii = 1:length(problabels)]; #number of documents in which word occurs
for ii = 1:length(problabels)
    for ng in probwords[ii]
        if contains(ng," ")
            for mention in crps2[ng]
                push!(docs_prob[ii],mention)
            end
        else
            for mention in crps[ng]
                push!(docs_prob[ii],mention)
            end
        end   
    end
end
numdocs_prob = [length(docs_prob[ii]) for ii = 1:length(problabels)]

parlabels, numdocs_par = sort_for_plot(problabels, numdocs_prob)

fig = figure("weee",figsize=(8,8))
b = barh(-1:-1:-length(problabels),100.*numdocs_prob/length(crps.documents),align="center",alpha=0.4)
yticks(-1:-1:-length(problabels),problabels)
title("Problems of Interest")
xlabel("Percentage of Documents")