User Attribute Inference via Mining User-Generated Data
von Shichang Ding
Datum der mündl. Prüfung:2020-12-01
Betreuer:Prof. Dr. Xiaoming Fu
Gutachter:Prof. Dr. Xiaoming Fu
Gutachter:Prof. Dr. Marcus Baum
EnglischUser attributes refer to a person's various demographic characteristics, like income, education, job, age, gender, socioeconomic status (SES), etc. User attributes play an important role in many research areas like sociology and education. Recently, companies have become more and more interested in user attributes because these attributes are also valuable to many emerging applications, such as personalized recommendation, customized marketing and precise advertisements. For example, previous works leverage the users' age, gender, occupation to improve the performance of personalized recommendation. The manual survey is the traditional way to collect user attributes, which is highly expensive and time-consuming. Many researchers try to infer user attributes based on various kinds of user-generated data, like people's tweets or cellphone records. Compared with the survey method, these proposed machine-learning-based user attribute inference (UAI) methods are much quicker and cheaper. However, there are still many open challenges: to introduce new kind of user-generated data source into attribute inference; to improve the accuracy for multiple attribute prediction based on limited data sources; to improve the performance of user-attribute-enhanced (UAE) tasks by UAI methods. For the first challenge, human mobility data based socioeconomic status (SES) inference is chosen as a case study of introducing new data source into UAI. The notion of SES of a person or family reflects the corresponding entity's social and economic rank in society. This attribute can help applications like bank loaning decisions and provide measurable inputs for related studies like social stratification, social welfare and business planning. Traditionally, estimating SES for a large population is performed by national statistical institutes through a large number of household interviews. Recently researchers begin to estimate individual-level SES from people's social media data. However, these methods cannot work if researchers cannot get people's cyberspace data. So we need to continue to introduce new data sources, especially some widely recorded real-world users' behavior such as human mobility. In this work, we leverage Smart Card Data (SCD) for public transport systems, which records the temporal and spatial mobility behavior of a large population of users. More specifically, we develop S2S, a deep learning-based method for estimating people's SES based on their SCD. Essentially, S2S models two types of SES-related features, namely the temporal-sequential feature and general statistical feature, and leverages deep learning for SES estimation. We evaluate our approach in an actual dataset, Shanghai subway SCD, which involves millions of users. The results show that the proposed method can use mobility data for SES inference and clearly outperforms several state-of-art methods in terms of various evaluation metrics. For the next challenge, home location-based multiple Socioeconomic Attributes (SEA) Inference is selected as an example problem of improving the accuracy of multiple attribute inference with the limited input information. Inferring people's socioeconomic attributes (SEAs) including income, occupation and education level is an important problem for applications like personalized recommendation and targeted advertising. Some methods have been proposed to estimate SEAs, if users have rich information like tweet contents through a long period. However, the accuracy of these methods may be affected if researchers can only get limited information of users (e.g., no or very few tweet content). Besides, limited by the budget and time, researchers may have to estimate as many as attributes with a limited data source. Multi-SEA-inference based on limited information is even harder. Here we choose home location as an example of limited data sources. The longitude and latitude of home location is often used as a supportive data source in UAI work. The accuracy of existing methods will be seriously affected if we only get users' home location. In this work, we try to predict a person's income level, family income level, occupation type and education level from his/her home location. We collect people's home locations and socioeconomic attributes through a survey involving 9 provinces and 85 cities of China. Then we design new basic features by enriching home location with the knowledge from real estate websites, government statistics websites, online map services, etc. To learn a shared representation from input features as well as attribute-specific representations for different SEAs, we propose a multi-task learning method with attention mechanism, which is called H2SEA. The factorization machine-based embedding component of H2SEA can also generates more kinds of new interacted features base on the input basic features. Extensive experiment results show that the proposed H2SEA model outperforms alternative models for SEA inference in terms of various evaluation metrics, such as AUC, F-measure, and specificity. The first two works are focusing on improving the performance of UAI itself in different scenarios. In the final work, we expand the focus to improve UAE tasks with the help of UAI. There are two kinds of tasks relying on user attributes. For user-attribute-based (UAB) tasks, researchers cannot carry out these tasks without user attributes. For UAE, attributes are not necessary, but can be used to enhance their performance. From the first two challenges, we can see designing an accurate UAI method requires a lot of works including data mining and model design. UAE researchers usually would rather give up the benefits of UAI to lower the cost, especially if the missing rates of attributes are too high or there are many kinds of missing attributes. In this thesis, we take collaborative filtering (CF) recommender system as a case study of UAE tasks. CF recommendation methods mainly rely on user-item history interactions, which may suffer from the interaction sparsity problem. Therefore, some algorithms have been proposed to leverage user/item attributes (e.g, user location or item brand) to enhance the recommendation performance. However, in real-world datasets, user/item attributes are often missing for reasons like privacy concerns. CF recommender systems usually use unknown tags or zeros as simple substitutes of missing attributes instead of leveraging UAI. In the final work, we first conduct empirical experiments to quantify how the recommending performance can be affected if we just use simple substitutes for missing attributes. Then we discuss how to alleviate this negative impact caused by the missing attributes by UAI. Although recommending and UAI are usually separately studied, we argue they can be both seen as graph node representation learning tasks based on node interactions. We develop a novel multi-task Attribute-Enhanced Graph Convolutional Network (AEGCN) method, which enhances recommendation by auxiliary UAI tasks. The auxiliary attribute inference tasks can send estimated attribute information to the recommending task, improving the recommendation performance with incomplete attributes. More specifically, we define recommending and profiling in one user-item bipartite graph. The two kinds of tasks share one graph convolutional network (GCN) to learn the user/item-hidden representations. Then the user/item representations are used for profiling while their combination is used to predict users' preference on items. Extensive experimental results on three real-world datasets demonstrate that AEGCN is simple yet effective for missing attributes. Compared with attribute-enhanced CF models, AEGCN achieves comparable performance when the attributes are complete, and significant improvements when the missing rate increases. This thesis chooses mobility-based SES prediction, home-based SEA prediction and CF recommender system as case studies of three open challenges of UAI. The three challenges studied in this thesis belong to a general effort to expand UAI from one-attribute-prediction to multiattribute-prediction and finally multi-task framework, which includes both UAI and UAE tasks.
Keywords: housing price; user attribute; attribute inference; data mining; socioeconomic status; income; human mobility; deep learning; subway; home location; multi-task learning; recommendation; graph nerual network; collaborative filtering