6.S194 Open Source Entrepreneurship
Units: 2-5-5
Prerequisites: 6.005, 6.006, preference given to 6.170 and 6.172 alums
Instructors: Professor Saman Amarasinghe (saman csail.mit.edu)
Co-Lecturer: Nick Meyer (entrepreneur-in-residence at the Trust Center)
TA: Jeffrey Bosboom (jbosboom csail.mit.edu)
Schedule: TR2:30-4, room E40-163
Enrollment limited to 30 students
Working in small teams, students will expand existing personal or MIT software systems from an unsupported "code dump" on GitHub to a full open-source project suitable for actual collaboration and use. We expect students to perform a substantial amount of programming to bring the project up to a usable status to support real use cases and the code quality to an industry standard. Beyond just programming, students will experience the full project management lifecycle, including requirements gathering, usability testing, documentation and technical writing, prioritization and risk management, project coordination, community support and outreach. Students are expected to take leadership roles in their project, supported by a graduate student, postdoc or faculty mentor to bootstrap project-specific knowledge. Course meetings will be split between studio sessions, during which teams can coordinate and work together, and lectures on software engineering and project management topics relevant to the projects.
Disciplined Entrepreneurship by Bill Aulet. MIT students have electronic access through the MIT libraries (click the "Get this at MIT" icon at the above link, then click a link in the table of contents).
Assignment 1: Mentor Interview, due Thursday, February 16.
Assignment 2: Find Users and Be a User, due Thursday, February 23.
Assignment 3: Minimum Viable Product, due Tuesday, March 14.
Assignment 4: Promotion Plan, due Tuesday, April 4.
Lecture 1 (project descriptions and course outline)
Lecture 4 (full life cycle use case; high-level specification)
Lecture 6 (risk management; minimum viable product). Worksheet
If you'd like to propose your own project, please e-mail Saman and Jeffrey (addresses above) with details.
Mentor: Tim Abbott
Description: Zulip is an open source group chat system. Technical projects include working on the React Native-based mobile app, the Electron-based desktop app, the Snipe terminal/command-line IM client. Separately from apps, an interactive chat bot framework is desired.
Zulip has a framework for doing user input validation within Django, which could be split out into a separate library that would be very competitive against django-rest-framework, the current state-of-the-art library for Django input validation.
Blue-sky technical ideas include federating Zulip with other protocols (e.g. Jabber or IRC), Dockerizing Zulip and making deployment easier, or adding voice and video chat or real-time collaborative editing.
On the user-experience and community-building side (but still somewhat technical), there should be an easy on-boarding path that allows open source project maintainers to quickly set up Zulip as a support/discussion system for their project, either hosted on zulipchat.com or self-hosted. This would involve some amount of GitHub integration work, as well as lots of community outreach to get people using it and fixing the inevitable small problems that will come up.
URL: https://zulip.org, https://github.com/zulip/zulip/
Notes: Zulip has a global community of dozens of contributors, but it has such a huge surface area as a project that there are major areas that could really benefit from a team taking ownership and making that area great.
Mentor: Frank Wang
Description: Attribute-based encryption is a recent cryptographic primitive that allows users to selectively decrypt certain ciphertexts based on their given access control permissions. This is extremely useful in medical settings where data is stored in a central database, but different employees, e.g. nurses, doctors, insurance companies, etc., have different access control policies. The hospital would like to give each employee just one key that will allow them to decrypt certain parts of the encrypted database. Although this is a useful tool and has gained a lot of traction in research, there does not exist a good open source library. Many older libraries exist, but they do not take advantage of modern advancements in hardware and cryptography techniques. The goal of this project is to implement such a library for the research community to use and benchmark it against previous implementations.
Notes: Cryptography background and strong programming skills are greatly preferred.
Mentor: Frank Wang
Description: Many times, data is stored encrypted on a server. However, a user's key might become compromised, or she might have given her to key to a party that she no longer trusts. As a result, she needs to re-key her data. Downloading the data, decrypting, and re-encrypting it is expensive. Ideally, she would like the server to re-key her data, but she doesn't want to trust the server too much, so any form of decryption on the server is not allowed. As a result, she needs to re-key her data using a proxy re-encryption scheme. Some basic libraries have been implemented, but they have not been open sourced and have not been thoroughly tested. The goal of this project is to open source this library so that others can use it. There is already interest from industry in having such an open source library.
Notes: Strong cryptography background and programming skills are preferred.
Mentor: Frank Wang
Description: Function Secret Sharing (FSS) is a recent cryptographic primitive that allows a client divide a function f into function shares f_1, f_2,...,f_k so that multiple parties can help evaluate f without learning the input. This is a powerful technique that allows users to do private querying and anonymous community with very little bandwidth overhead. Currently, there are no good open source libraries to do this. The goal of the project would be to create a library with a clean API so that other projects can build up on this. There is a lot of interest around this in the research community, and it would be heavily used.
URL: https://github.com/frankw2/libfss
Notes: Strong cryptography and programming background are preferred.
Mentor: Raul Castro Fernandez
Description: Organizations face a data discovery problem when their analysts spend more time finding relevant data than analyzing it. This problem has become common as: i) data is stored across multiple storage systems, from databases to data lakes; ii) data scientists do not operate within the limits of well-defined schemas, instead they want to find data across their organization to answer increasingly complex business questions, such as building machine learning models or finding evidence for hypothesis.
We have built Aurum, the first system to perform data discovery at large scale (MIT-licensed). It approaches the problem through three differentiated layers. The first one, a high-performance profiler built in Java, has the goal of summarizing terabytes of data into profiles that contain signatures that represent that data. The second one, a graph builder built in Python has the goal of mining all existing relations between profiler through efficient techniques based on minhash signatures and locality sensitive hashing, and represent them in a hypergraph. Finally, a third layer exposes a discovery algebra, called SRQL, that permits users to declare discovery queries. Aurum is in use by different companies in the area , including a big pharma company, a data integration company, and teams within the MIT organizations, and it is to be deployed within the analytics infrastructure team of the city of NYC. There are a multitude of projects at each of the different layers.
URL: https://github.com/mitdbg/aurum-datadiscovery
Mentor: Anish Athalye
Description: Gavel is a judging system written by the HackMIT team designed to fully automate judging at project expos. The system incorporates research on mathematical psychology along with fancy math to produce fair judging results. Gavel has gotten pretty good traction: it's already been used at over 20 high-profile events like HackMIT, HackPrinceton, and WildHacks.
The software is barely at v1.0, so there's still a lot of interesting technical work to be done – the issues page lists some ideas for new features, and there's also a lot of room for creativity and for larger-scale features that could be added to make Gavel a better, more flexible, and more usable system. There's also a lot of work that could be done on the dev evangelism / entrepreneurship side of things, like working to make the documentation friendlier, convincing people to use the system, helping people get set up with it, and finding applications beyond just hackathons.
In case anyone is interested in the details of how the system came to be, we've written a couple blog posts about the motivation for the judging model, the math behind the current implementation, and the v1.0 release of the software.
URL: https://github.com/anishathalye/gavel
Notes: It would be awesome if we can find creative team members who can come up with their own ideas for how to make the project better (rather than only implementing the mentor's ideas).
It would also be great to find people who are also interested in the dev evangelist side of things in addition to the technical work, because that's definitely something Gavel needs right now.
Mentor: Reja Amatya
Description: More than 1.2 billion people around the world live without any electricity and even a greater population does not have access to reliable electricity. This energy poverty results in a lower quality of life and is a significant hurdle in the economic development of these areas.
With a team of graduate students (EECS), we have developed hardware and software at MIT that can create a microgrid network with distributed resources with both generation and demand management capabilities. We have developed codes around effective communication between hardware units and on demand management and operation. Currently, most of our code base is generic as we have communication protocol using standard RS-485 MODBUS communication between micro-controller chips, and SPI communication between our power module and communication module. We also have code for communication between GSM chip and 2G network. As a part of the class project, the students will take these codes, and work with it. A big benefit we see of making this open source project is the external contribution we can get for making the code robust, as well as more ideas for demand management and optimization, but there is also the potential use of such generic algorithms for other applications beyond microgrid operation.
Notes: EE background will be helpful. A functioning prototype is ready to be used by students if needed.
Mentor: Erik Hemberg
Description: The MOOCdb project aims to brings together education researchers, computer science researchers, machine learning researchers, technologists, database and big data experts to advance MOOC data science. The project founded at MIT includes a platform agnostic functional data model for data exhaust from MOOCs, a collaborative-open source-open access data visualization framework, a crowd sourced knowledge discovery framework and a privacy preserving software framework. The team is currently working to release a number of these tools and frameworks as open source.
URL: http://moocdb.csail.mit.edu/, https://github.mit.edu/ALFA-MOOCdb-6S194
Mentor: Ed Doddridge
Description: Oceanographers use numerical models to test out understanding of the ocean and the climate system. These models can be huge climate models with atmospheres, sea ice, and oceans that take months to run on huge supercomputers, like those featured in the IPCC reports, they can be simple models that run quickly on a single computer, or they can be somewhere in between. Many researchers write their own simple models and describe, in general terms, the numerical choices they made when solving the equations. Unfortunately the source code is rarely made publicly available. This makes it difficult to reproduce their results, and leads to poorly documented code being handed around through ad-hoc channels. This project is intended to provide the community with a simple model that is well documented and easily configured so that research findings may be more easily replicated and tested.
I am using this model with some collaborators to do idealised modelling of the ocean in the Arctic. It's a very idealised and algorithmically simple model that is intended to be cheap enough to run on a normal computer, but complex enough for research - provided one is asking the right questions.
The source code for the model exists along with a few examples, but it lacks tests, and the documentation needs quite a lot work. There are a few extensions I would one day like to add. For example, extending the model to work in periodic domains and using a faster algorithm for the matrix inversion,
URL: github.com/edoddridge/MIM
Mentor: Fredrik Kjolstad
Description: Linear and Tensor algebra are the building blocks of the modern optimization, simulation, machine learning and data analytics applications. Hundreds of libraries exist for linear algebra and recently libraries like Google's TensorFlow provides dense tensor algebra, but support for sparse tensor algebra is lacking. The Tensor Algebra Compiler (taco) makes fast and portable sparse and dense tensor and linear algebra possible. We have a good research code written in C++ that promises unprecedented flexibility, performance and portability. With your help we will make taco the default library for these domains.
Mentor: Tristan Naumann and Anna Rumshisky
Description: Clinical notes written by doctors about patients contain information about disease progression, treatment efficacy and patient outcomes. The ability to reliably extract and identify patterns among these data lead to important breakthroughs in clinical treatment options. Want your work to make a difference in healthcare?
Clinical Named Entity Recognition system (CliNER) is an open-source natural language processing system for named entity recognition in the clinical text of electronic health records. Specifically, CliNER is designed to follow best practices in clinical concept extraction to identify clinically-relevant entities mentioned in a clinical narrative, such as diseases/disorders, signs/symptoms, medications, procedures, etc.
Its role as a component in clinical pipelines lends well to user-facing improvements, such as API development (currently it is primarily used as a command-line tool), identification of the best packaging option (currently several are provided), and charting a roadmap for folding in improvements (currently old methods are removed to introduce new methods). Improvements related to online presence are also appreciated and, of course, there's always plenty of additional research that could be implemented depending on interest!
URL: http://text-machine.cs.uml.edu/cliner/, https://github.com/text-machine-lab/cliner
Notes: This project exists in a limbo state between research and one that has drawn user attention following its introduction at AMIA CRI. With its emphasis on natural language processing, members would benefit from prior experience with NLP and/or ML; however, we've previously worked with undergraduates with no prior experience but sufficient interest. Similarly, the project is written in Python and would therefore benefit from members with Python experience.
Mentor: David Karger
Description: Tipsy (http://tipsy.news/) is a Chrome browser extension that delivers voluntary microdonations from generous consumers for content they read on news sites and other publishers on the web. It's related to Flattr and Patreon but has some significant advantages. Tipsy was developed by a professional engineer but is now in our hands to take further. This involves a combination of marketing/evangelism and development. The goal of recruiting "customers" among publishers and users is obvious. There are also a variety of improvements we wish to develop to the tool.
Right now, payments are delivered via Paypal and Dwolla, which have transaction fees and annoying interfaces that make Tipsy less effective. So I'm seeking a student interested in adding the ability to pay using Bitcoin. Another direction I'd like to extend Tipsy is to apply it to help support creative commons content that can be embedded in many pages. Tipsy also has, hiding inside it, a powerful "superhistory" component for detailed tracking of user activity (in a completely private way). I'd like to factor out this component because it could be useful for many other tools. There are also some very interesting opportunites to explore Tipsy as a platform that can help mediate between users and publishers. For example: Tipsy could carry a variety of demographic and preference information, and use that in coordination with the web site to customize the user's experience even on the first visit.
URL: http://tipsy.news/, http://github.com/haystack/tipsy
Notes: This Chrome extension was built by a professional developer who is no longer accessible. So we're going to need to do some work figuring out the build process and such but it should be doable. Particularly useful background is experience in Chrome extension development (more generally javascript) and bitcoin APIs.
Mentor: David Karger
Description: NB (http://nb.mit.edu/) is our research system that provides online discussion of course lecture notes, videos, and other material in the margins of that material so you can easily find and contribute to the right discussions. It's being used by ten thousand students at over 100 universities around the world to produce over 1 million comments. I am seeking web designers, software engineers, and researchers to help us maintain, extend, and study the NB system.
For students interested in back-end software engineering or front-end web application design, NB is a an old crufty system (Django back-end + Javascript front end) that has grown haphazardly over many years and needs significant redesign and re-engineering to use modern themes, components, libraries, and UI designs.
For students interested in online education and research, NB provides a platform with real users and data for research about how online discussion can help people learn. There's a broad collection of features that have been requested by users and others we've thought of ourselves, that need to be implemented (inside the existing crufty code-base), deployed, studied, and evaluated.
Interested students should have completed at least 6.005 or equivalent and have at least some familiarity with Python, Javascript, and collaborative software engineering.
URL: http://nb.mit.edu/, http://github.com/nbproject/nbproject
Notes: This is a "build the plane while you fly it" project---we have a large user base we have to keep happy; at the same time I think the system basically needs a rebuild from scratch.
Mentor: Manasi Vartak
Description: Companies have embraced the power of AI (deep and shallow learning alike!) and have made machine learning a core part of their products. However, infrastructure and tools to support the ML lifecycle are sorely missing. The ModelDB project addresses one of the key pain points: the lack of a system to manage the hundreds of machine learning models that are built everyday.
ModelDB is an end-to-end system to track machine learning models as they are built, store models and metadata in a centralized fashion, and capture metadata and metrics to support query, analysis and reuse. It enables ML teams to version their models, make them reproducible and easy to share, compare and analyze.
ModelDB is the first model management system out there and is currently being tested at a handful of companies. We've gotten really positive feedback so far and the system will be open sourced (MIT license) in early February. We are expecting to get a large number of users after release and there are several directions the project could go, depending on the team.
This project would be a great opportunity to: (i) build infrastructure that will be used across a lot of companies, (ii) learn about machine learning, (iii) work with users from different domains (e.g. banks, startups, social media companies) to build out impactful features, (iv) opportunity to join and shape a nascent open-source community.
Some potential projects: (i) building a ModelDB client for a new language or environment (e.g. R, lua, tensorflow), (ii) adding advanced features to ModelDB depending on feature requests and new requirements (e.g. in java, scala, python), (iii) adding visualization capabilities to the frontend to support more flexible querying of data, (iv) supporting online model updates.
Who should consider the project? You've done software development projects in the past (backend or frontend, equivalent of 6.170, 6.814, 6.824). You are interested in improve your skills by working on a real-world system. You are excited to learn about machine learning and build systems to support it. You are willing to help shape a small open-source community.
URL: https://mitdbg.github.io/modeldb/
Notes: ModelDB is currently being tested by a handful of companies and we've great feedback from them. There are 20+ other companies who have also expressed interest in using ModelDB. We will be open sourcing (MIT license) the current version of ModelDB at the Spark Summit next week. We expect a lot of developers to try out ModelDB and also to request tons of features. We are looking for students to continue work on ModelDB (two of the key contributors are graduating at the end of IAP) along with the two remaining team members (Manasi included). There are a lot of interesting directions for the project, some listed in the description. We are looking for students who have done software engineering projects before (particularly with java/scala/python for the backend and web projects for the frontend). There is also room to work in R in order to develop a brand-new R client, potentially in collaboration with one of the larger companies.