* For the sole purpose of studying Machine Learning (ML), of course – but not without a slightly unhappy scowl from my beloved wife.
In the online dating world, there is likely no easier application than Tinder. One finger and a few neurons are all you need to swipe and choose the girls you like best. The popular app provides a simple, hassle free way to get closer to meeting your better half.
Given its ease of use, I decided that Tinder would be the ideal candidate for a little experimentation. It was time to try out a little Machine Learning (ML) on my new GPU, with Tinder as my subject. The only thing left to do was assure my wife that I wasn’t perusing the internet in search of a new partner – I was just conducting an experiment in training neural networks.
What is the main problem with dating sites?
Adult online dating behemoth Ashley Madison runs the slogan, “Life is short. Have an affair.” The site caters mainly to married men looking to have a saucy side affair.
Ashley Madison’s business model is very interesting – in addition to the standard, “Buy points / Spend points to like and write” to potential mistresses, the site asks for $19 to delete a user account – discretion at a cost.
In 2015, the site was hacked and 60 GB of personal data was leaked. The privacy and secrecy that users had come to expect from Ashley Madison as a leading adult dating site suddenly went belly up, as thousands found their key personal details exposed to the public eye.
Beyond the obvious plethora of happy lawyers and destroyed families, the leak yielded a wealth of interesting information to analyze.
I had always suspected that Ashley Madison’s client base would be mostly male – that men would outnumber women in their search for discreet online affairs. However, the facts were shocking.
When analyzing the leaked data, American journalist Annalee Newitz found that of Ashley Madison’s 5 million users, only 12,000 accounts were those of real women who regularly used the site. As for the rest? Just bots with pretty photos.
Similar situations are not uncommon across other dating sites. Numbers of male accounts vastly outnumber those of women, much in the same way as the simple economic precepts of supply and demand.
For men, these figures present an undoubtedly unfair situation. A growing number of men have reported outright frustration as they find themselves making great efforts, only to be met with unresponsive women or, worse yet, women who haven’t even registered their accounts. The odds are clearly in women’s favor.
The unique Tinder approach.
Tinder’s greatest draw for users is the low effort required to get a match. All it takes is two coinciding swipes to the right and you’re connected with a potential partner.
The problem of gender imbalance, though, finds most women with dozens of matches per day while men aren’t so lucky. Women are less likely to pay attention to specific individuals like you when they are dealing with an influx of other candidates.
It is easy to see that the Tinder platform operates largely on a superficial basis. Users are given very few opportunities to learn deeper insights about potential partners, opting instead for visual snapshots. A photograph of a woman posing in a swimsuit or driving a supercar hardly provides insight into her personality, aspirations or personal preferences.
The pressure of finding a match rests almost entirely on photo quality. If your photos are anything less than captivating, eventually you will have no choice but to increase your chances by adopting the r-selection strategy. Simply stated: casting a wider net increases the chance of success.
Since it is impossible to spend every waking moment in search of a partner, and because swipes are limited, some might opt for automated matches. Just create an ML model based on your preferences – for example, short redheads or tall brunettes – and let the app do the work for you while you go about your day.
Firstly, more data will produce better accuracy. Anyone who has come across Machine Learning knows how difficult it is to collect and correctly assemble a dataset. Theoretically, any similar resource is suitable as a data source, be it Instagram or other social networks, but it’s always best to train the model using samples on which the network will work in the future.
I will use the Tinder Automation repository as the basis. Tinder’s photos are always publicly available, but the “likes” feature is limited to members only. Therefore, it is necessary to extract all available data and carefully mark it. To do this, I used a fairly simple script:
This script will allow you to quickly mark the dataset with just two buttons. The key pitfall: the “Werkzeug” library has broken backward compatibility and we will have to do a forced downgrade. Otherwise, the script errors out.
Therefore, in “requirements.txt” it is necessary to set “Werkzeug == 0.16.1”. Then the script will run successfully.
The second action is to get the token. The standard method from the repository didn’t work for me, but I managed to get it from the developer’s console.
To do this, follow this link: http://www.facebook.com/v2.6/dialog/oauth/confirm?dpr=1 and pull out the response from the POST request. Inside the request we search for “access_token”. Take the token and hard-code it into a script.
There are several key requirements for Machine Learning datasets:
- Adequate amount of data
Adequate amount of data
At least 10,000 photographs are required to build an adequate model. Yes, that is a lot! That’s why there are services like Mechanical Turk, where for a fee you can delegate the layout of your dataset to other people. On the other hand, this is your bot and it should reflect your personal tastes in women.
In this case there is no problem with variety – a wide array of the photos is presented in a variety of perspectives and lighting: women with glasses, wearing dresses, posing in swimsuits, etc. However, there is a problem with the consistency of the dataset.
Ideally, when we mark up our sample, it should consist of approximately equal parts (the dataset needs to contain more than just samples of pretty women, otherwise it will be skewed). If you get a “skewed” dataset, you will have to dilute it with photos from other sources. You will need to add more attractive ones or, conversely, you will determine based on the result of the markup.
My results were around ‘62% pretty’ for my taste, suggesting either that I’m not too picky or I just got lucky because there are a lot of pretty girls around.
I also suspect that many of them are bot accounts. We train a bot that will like other bots – how ironic is that.
Processing the Data
Now we have a bunch of tagged photos, but they are not similar: they range from day to night, color to black and white, even shots from the back.
By this point it became apparent that ML training with these types of photographs would not work. Therefore, the best option would be to use just the faces.
For this task, I used the Haar-like feature. This is a great algorithm that allows you to find faces in images with a low percentage of false-positive errors.
You can read more about this in OpenCV documentation.
At this next stage, when only faces appear in the sample, it makes sense to remove the color.
Let’s build the model
Keep in mind that without a good video card and CUDA, you most likely won’t get a trained model in an adequate time frame. Also, you will have the option to use a cloud provider service.
I took a basic three-layer example repository and, surprisingly, it showed about 72% accuracy – a good result.
To my surprise, I received an average of 17 matches per hour. Moreover, in several instances the woman made contact first (although this maybe because I didn’t use my picture).