arXiv is a prestigious repository of electronic preprints (known as e-prints) approved for publication after moderation. It consists of scientific papers in the fields of mathematics, physics, astronomy, electrical engineering, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online.
As part of being selected for the European Data Incubator we set out to solve the challenge published by JOT Internet Media: “[EDI-2018-4-JOT_2] Sentimental analysis in consumer journey”. Our approach was to predict user intent based on the keywords they’re using in search engines such as Google.
We started with a proof of concept back in September and improved it over the past 3 months – this entire process is throughly described in the paper that has been approved and published on arXiv: https://arxiv.org/abs/1812.07324
In essence, predicting user behavior on a website is a difficult task, which requires the integration of multiple sources of information, such as geo-location, user profile or web surfing history. In this paper we tackle the problem of predicting the user intent, based on the queries that were used to access a certain webpage. We make no additional assumptions, such as domain detection, device used or location, and only use the word information embedded in the given query. In order to build competitive classifiers, we label a small fraction of the EDI query intent prediction dataset, which is used as ground truth.
Then, using various rule-based approaches, we automatically label the rest of the dataset, train the classifiers and evaluate the quality of the automatic labeling on the ground truth dataset. We used both recurrent and convolutional networks as the models, while representing the words in the query with multiple embedding methods.
For future work, we will try to improve the labeling rules, looking further from just words alone, but treating the entire query. We will also look into using various other features, such as parts of speech, chunking, domain classification. Another interesting idea is to retrain the word embeddings on the large EN dataset, and then use these embeddings with contexts formed by using similar words, but the other columns as well (such as ID group, impressions or clicks).
This is an amazing result for us and a validation of our hard work. Special thanks to our colleagues Mihai Cristian Pîrvu, Alexandra Anghel and Alexandru Constantin for their contributions to this paper.
We’ll follow-up early next year with an updated on the MorphL Community Edition where we’ll include the entire pipeline for classifying user intent based on search queries. Stay tuned for more news cause 2019 is gonna be the bomb!