What is it ?
The objective is to enable searchable math and help people find math content on the web more easily and accurately.
What problem it is trying to solve ?
Students learn math by examples via observing how a math formula or equation is used in a variety of places. Through this, they gain perspective and deeper understanding of the knowledge. However it is difficult for them to collect these information outside the scope of textbooks: existing search engines mostly provide text-based search which makes expressing math-related query difficult.
On the other hand, academic researchers develop their knowledge and base their work on existing literatures. Academic publications contain a lot of structured math content that is valuable but difficult to index and search for. There also exist inter-dependencies among these knowledge that are up to now could only be identified with a reader’s experience and memory.
On a broader scope, the world-wide web is growing exponentially, while in the mean time, the quality of existing search engine is continuously degrading. The inherent issue is the noise being introduced by indexing mass amount of uncategorized, hard-to-filter content from heterogeneous sources. However, the reality is that people are using them as their essential learning tool on a daily basis. We believe that this is not the most effective and efficient way of self-education and knowledge discovery.
The above problems are essentially the same: the lack of a search tool to help people express structured content such as math, and the lack of ways to index, organize and present these knowledge with good quality.
Askmath will provide a WYSIWYG web app that allows user to enter math formulas in their natural written form. On the backend, the search involves applying heuristic algorithms to convert a presentational form of the math (e.g. MathML) to its unambiguous semantic counterparts, and then search against an indexed database of mathematic corpus to identify the relevant ones that are semantically similar to user’s query.
AskMath started as a side project and I had been working with my undergrad (GDUT) friends in GuangZhou, China since 2008. The motivation is simple: we looked at the online education market in China and thought there is a gap in between what students demand to learn outside classrooms and the lack of a quality tool to do so. Especially for science and engineering study where mathematical content are the central elements, it is not easy for people to collect, organize and search for the information using existing search engines. Having had 2+ years of experiences working in a start-up (Btwxt Games) and some 5 years of academic research practice in the field of empirical algorithm design, I decided to give it a try and see what it would become.
- Index DB: inverse indexing database of mathematical corpus that conforms to Apache Lucene index format
- Redis DB: contains original raw document content that can be retrieved via unique document ID
- Lucene Indexer: A heterogeneous parser that consumes existing mathematic content (latex source file, pdf, html, etc.) and convert into lucene index DB.
- Search Server: Search engine core that parses user query and retrieves semantically similar content from the Index DB.
- Web Server: serving WYSIWYG web app to the user.
I implemented a barebone prove-of-concept prototype to demonstrate the idea. It allows you to enter simple math formulas and query against a sample database of latex documents. A few implementation details of the front-end are:
- html layout using Bootstrap
- math rendering with MathJax math editor is a modification on MathJax that enables interactive editing on part of the math formulas. Keyboards are enabled (direction arrows) for easier navigation within the formula.
- socket.io and DNode are being used for client app communication with the search server.
The search server and indexer are implemented in Node.js with two native C++ module extension
- clucene for content indexing and query that conforms to apache lucene format
- tralics library for parsing latex source to xml expressJS, dnode and socket.io are used for communication with the front-end web app.
There are a lot needs to be done.
- Scoring Function: that takes two math formulas and calculates a similarity value in between them. This would be implemented as an addon to the apache lucene’s scoring framework and ultimately drives the math indexing and search query.
- Java-based Search Server: the clucene project is outdated and hasn’t been kept in sync with the official Apache Lucene Core. To integrate with the java-based lucene core while maintaining a high-performance web service platform, Netty.io would be a good choice.
- Content Parsing and Aggregation: including existing online repository of academic publications, as well as mathmml content on the web.