All about the competition
In the Dataracing 2023 competition, you will have the opportunity to analyse the dataset regarding the credit portfolio of the domestic population using various data science tools. The business problem to be solved revolves around the prediction of credit default. When a loan defaults, it means that the borrower (the debtor) is not able to repay the debt to the lending financial institution on time or at all. The anonymised data needed to create the data science solutions was provided by the co- organisers of the competition, the MNB.
The aim of the data analysis is to estimate the probability that a borrower will have at least one defaulted (non-performing) loan in the near future by examining loan data.
How far can you predict the probability of a loan default from historical data?
How does the number of people linked to the loan affect the default of the loan?
What analytical tricks can be applied to such a data set?
Which algorithms are best suited to solve the problem?
We are looking for answers to questions like these.
How can I join?
- Register on the competition platform, read and accept the competition rules.
- Download a dataset containing credit data and a sample submission.
- Come up with a solution to predict the target variable.
- Formulate your solution according to the sample solution.
- Submit your solution on the competition platform.
- Improve your solution and upload it until the end of the competition and take home the prize!
About the data used in the competition
The initial dataset for the competition includes data on all retail loans over a period of almost three years. The data set is anonymised and distorted. The amount of the loan, interest period, type, repayment period and much more are known. For each loan product it is known which borrowers have taken out the loan, but of course a loan can be taken out jointly by several borrowers at the same time – in which case the initial data set will contain several rows with different borrower information.
Each row also shows whether or not the loan has defaulted during the given nearly three-year period. Likewise, it will be known if early repayments happened regarding a given loan. In addition to the fact of default or early repayment, it will always be indicated when the event occurred.
In this special data science challenge, we want to know whether customers who have performing loans during the three-year period can continue to pay their monthly instalments in the following two years or whether they will have loans that they can no longer pay in time and will be in significant delays and become defaulted. We want to predict the default events over the next two years. What makes the exercise special is that no data from this period is made available to the participants, the forecast has to be generated using a baseline dataset of almost three years.
Two things are provided to help participants: the example solution shows all the actors that need to be estimated. Also, in the example solution, we specify a fixed probability value, which is the proportion of customers in the total submission dataset who had at least one loan with significant delays regarding their monthly instalments in the period in which the target variable is defined.
Evaluation
The winner of the competition is the one who can most accurately predict whether each borrower will have a non-performing loan in the next period. The target variable is therefore a binary value, and participants are expected to provide a value expressing the probability of a default.
The solution file should contain all client identifiers that have a credit on the initial dataset but none of them have been defaulted, and the estimates over the interval [0,1] describing the prediction next to the identifier. Each solution file is evaluated by the competition platform and the result is displayed on the leaderboard, where competitors can compare themselves with other competitors. The leaderboard shows the best result of each competitor at that moment.
During the competition, participants can continuously improve their solution and submit multiple solution files. Up to 10 solutions can be uploaded per day. Be smart with the number of submissions, especially on the final days of the competition.
The solutions (.csv files) submitted to the competition will be evaluated by the competition platform. The logloss metric is used for the evaluation. The better solution in the competition is the one with the lower value.
The evaluation is done in two phases, as part of the test data is separated for the final evaluation. The logloss value obtained on a random part of the test data is public, and the resulting values are displayed on the public leaderboard for all to see. However, the results of the second random part of the test data can only be seen by the tournament organiser on the private leaderboard of the tournament. The winner of the competition is the person who achieves the best result on this private leaderboard. In other words, your ranking on the public leaderboard does not necessarily correspond to your actual ranking in the tournament. Once the competition is over, the private leaderboard will be published. In a similar way, the two types of leaderboard prevent the competitor from extracting extra information from the analysis of the submitted results that would lead to the disclosure of the target variable values of the test dataset.
The results on the public leaderboard are therefore indicative and do not necessarily reflect the actual ranking of the competitors.
Deadlines
The competition is a one-round event.
- The competition starts on 27 October 2023 01:00 (CET)
- Deadline for submission of solutions: 3 December 2023 14:00 (CET)
- Winners will be notified by 14:00 on 4 December 2023
- Presentation of the winning solutions: 4 December 2023 – 6 December 2023.
- Final announcement of results: 12 December 2023
Awards
The top three participants on the private leaderboard will receive a cash prize. The total gross prize money is HUF 6 million. The winners will receive the prize money in the following proportions:
- The first place prize is HUF 1,560,000 net (41.8% of the prize money).
- The second place prize is HUF 937,000 net (25% of the prize money).
- The third place prize is HUF 625,000 net (16.6% of the prize money).
- There will also be a special prize of HUF 625,000 net (16.6% of the prize limit)
In the event of a tie during the evaluation, the solution submitted earlier is the winner.
To be eligible for a prize, the contestant must be able to reproduce the solution submitted and share the source code file used with the contest organisers. By checking these, the organisers will ensure that the entrant has not used any inadmissible tools, data or methods. Submissions for Dataracing can only be submitted by persons who have already reached the age of eighteen on October 1, 2023. Competitors can be Hungarian citizens or any private individual who has a Hungarian tax identification number, TAJ number and registered address in Hungary.
A special prize will be awarded for the best solution submitted by a student of a Hungarian higher education institution. Only full-time students of Hungarian higher education institutions are eligible for the special prize (proof of this must be provided before receiving the prize).
It is important to note that those competing for the special prize must postfix their „participant team” name with “_uni” / “_UNI” or “_Uni”. This will allow everyone on the leaderboard to see who is competing for the special prize.
Code of Conduct
By registering for and participating in the competition, competitors agree to be bound by the following rules and the official competition rules, which they will be made aware of at the time of registration:
- I can only register once as a participant.
- I will not share my source code or files used for the solution with other competitors during the competition. I will not share my source code or files used for the solution publicly during the competition.
- I will prepare my solution in such a way that the results can be run again and reproduced at the request of the organisers. I undertake to share the source code and files with the organisers of the competition on request.
- I agree to use only commercially available or publicly available software and programming languages for the competition.
- I agree not to share, copy or publish any data or information shared with me during the competition on any platform.
- I agree that I can only use the competition data (tutorial and test stack) to participate in the competition and submit solutions. I agree to notify the competition organisers immediately
- of any possible leakage or unauthorised access to the competition data by emailing hello@dataracing.hu.
- I agree that in the competition my solutions will be evaluated separately in the private and public parts of the test deck and that I will only see an approximate ranking on the public leaderboard during the competition. I agree that my results in the competition will be based on the evaluation of my submitted solutions in the non-public part of the test deck.
- I agree to be informed by the organisers of the competition about the information related to the competition, my results in the competition at the e-mail address provided at the time of registration.
- If I include data from external sources for the competition, I agree that this data must be publicly available data and I will clearly identify the source of this external data in my solution files.
- I agree that if I, as a winner, do not respond to the notification email within one week or do not wish to be included as a potential winner, I will not be eligible for any prize as a potential winner. In such a case, I agree that the competition organisers may award other potential winners based on the results. If, as a potential winner, I am unable to reproduce my submitted solution and/or do not provide the organisers with the requested source code by the requested deadline, I agree that I will not be eligible for a prize and the competition organisers may award other entrants.
- I agree that in case of any violation of the competition rules, doping or cheating, the competition organisers may disqualify me from the competition and if I am a potential winner, I will not be awarded any prize
Have a question?
Check the list below first to see if you find the answer. If not, please contact us at hello@dataracing.hu.
1. Who can enter the competition?
Submissions for Dataracing can only be submitted by persons who have already reached the age of eighteen on October 1, 2023. Competitors can be Hungarian citizens or any private individual who has a Hungarian tax number, TAJ number and registered address in Hungary.
2. I would like to take part, what should I do?
- Register on the Dataracing 2023 platform, read, and accept the Terms and Conditions.
- Download the training dataset.
- Have some ideas and work on a solution to predict the target variable.
- Format your solution according to the sample solution.
- Submit your solution.
- Improve your solution, upload updated versions and take the prize!
3. How can I submit a solution?
After registering on the Dataracing 2023 platform you can upload a solution file (.csv) by clicking on the Submission in the menu. Please use the sample solution file to check the formal requirements of your solution file. The language settings of your computer can determine the decimal point to be a comma, you need to make sure it is a point. The submitted .csv file separation characters must be commas and not semi-colons.
After submitting your solution, you can view your result on the public testing dataset in the list of your solutions. Your best solution also appears on the public Leader board.
4. What is wrong with my solution?
If file may not meet the requirements please make sure your file format matches the sample file.
- Less or more rows (The expected number of rows is 1,117,675, with the top row containing column names)
- More columns
- Comma instead of decimal point
- Wrong order of company identifiers
- Unknown identifiers
- multiple occurrences of identifiers
They can all cause an error message. In case you received an error message even if you uploaded the sample file please contact us at hello@dataracing.hu.
5. Can we start in a team?
No, you can only compete individually to complete the solution.
6. What technologies can I use to prepare my solution?
You can use any publicly available technology that provides a solution that can be reproduced. To be awarded you need to be able to reproduce your solution and to share the source codes and files with the organisers. Your solutions can be prepared by using any language (e.g., Python, R) and any public libraries. You can also use publicly and commercially available data analytics software applications. In case you have doubts about the technology you are using please contact us at hello@dataracing.hu.
7. What does it mean that my solution should be reproducible?
To be awarded you need to be able to reproduce your solution and to share the source codes and files with the organisers. A reproducible solution can be rerun and produces the exact same results on the private and public Leader board as before. Therefore we ask you to always fix the random parameters you have used. An example of random parameters can be the random_state parameter of the python sklearn package machine learning model.
8. What other rules apply to the competition?
You can read the detailed competition rules on the following page. By entering the competition you agree to these rules.