What is the Alchemy Dataset?

The Tencent Quantum Lab has recently introduced a new molecular dataset, called Alchemy, to facilitate the development of new machine learning models useful for chemistry and materials science.

The dataset lists 12 quantum mechanical properties of 130,000+ organic molecules comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

The Alchemy dataset expands on the volume and diversity of existing molecular datasets such as QM9.

For more details of Alchemy, please refer to the paper Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models. If you use the dataset in your research, please cite the paper below:

@article{chen2019alchemy,
  title={Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models},
  author={Chen, Guangyong and Chen, Pengfei and Hsieh, Chang-Yu and Lee, Chee-Kong and Liao, Benben and Liao, Renjie and Liu, Weiwen and Qiu, Jiezhong and Sun, Qiming and Tang, Jie and Zemel, Richard and Zhang, Shengyu},
  journal={arXiv preprint arXiv:1906.09427},
  year={2019}
}

 

Join the Alchemy Contest

Take part and help developing machine learning models to accurately predict organic molecular properties!

In this multi-feature learning contest, you are free to use whatever method you like to predict a set of 12 properties for organic molecules. To help train your model, a training set with the same set of molecular properties is provided below.

The competition will be conducted in two phases.

Phase 1 (Development)

5/22/2019 - 8/7/2019

A period for contest participants to get familiar with the Codalab and develop their models. Participants are asked to predict properties for molecules given in the valid.zip file. Five submissions per day are allowed.

Phase 2 (Evaluation)

8/8/2019 - 10/7/2019

The final stage for the competition. Participants are asked to predict properties for molecules given in the test.zip file. One submission per day is allowed, and twenty submissions in total are allowed in Phase 2.

The contest evaluation is based on the mean absolute error averaged over 12 regression tasks.

Rewards

 

Rewards

Rewards

A cash prize (total ¥100,000 RMB) will be awarded to the top three entries on the leaderboard in the Phase 2 only.

  • First Place Prize ¥50,000
  • Second Place Prize ¥30,000
  • Third Place Prize ¥20,000
Requirements

Winners of the first, second, and third place prizes must provide a clear model documentation and code according to the Declaration of Eligibility, Non-Exclusive License, and Release form. The Form will be distributed in Phase 2.

 

Contest Rules

Please refer to Contest Rules for full details. Every contest participant must acknowledge the reading of the contest rules before getting the datasets.

 

Contest Data

Training and Validation

Please download the dev.zip (training), valid.zip (validation in Phase 1) and test.zip (evaluation in Phase 2) files.

For development ( Phase 1, 5/22 - 8/7/2019)

The dev.zip contains 99,776 SD files, each giving structural information of a molecule, and a train.csv file giving the 12 properties of all molecules.
dev.zip   (updated 7/30)
md5sum: 70086cc2a2ac07f36a3a7c11a305a1a3

The SD filenames correspond to the molecular identification numbers found in the GDBMedChem database. These identification numbers are also used to distinguish molecules in the train.csv file. The molecular files are stored in different directories based on the number of heavy atoms.

The valid.zip contains 3,951 SD files. This dataset is to be used for the Phase 1 competition. It will be available for download on 5/22/2019.
valid.zip   (updated 7/30)
md5sum: dbe50df5f0b8a2771ed0f6f31481c035

For evaluation ( Phase 2, 8/8 - 10/7/2019)

The test.zip contains 15,760 SD files. This data is too be used for the Phase 2 competition. It will be available for download on 8/8/2019.
test.zip   (updated 7/30)
md5sum: e6b6f17882137118e2c323a77e793305

Rules

All the molecular properties are retrieved from Tencent Quantum Lab's Alchemy dataset. In this contest, all reported molecular properties are normalized by the substraction of population mean and divided by the standard deviations.

 

Optional Tools

RDKit

For contest participants without prior experiences in handling molecular data, we strongly recommend you learn to work with RDKit, a cheminformatics software that allows one to easily build molecular graphs based on the SDF files we provide.

RDKit - Getting Started in Python

Tencent Alchemy Tools

If you do not want to dive into RDKit, we also provide a ready-to-use pytorch dataloader which can help you easily deal with those molecules.

You may also find a collection of baselines, including MPNN, from which you can start your journey with Alchemy!

Tencent Alchemy Tools

 

Submission

answer.csv

The following description applies to both Phase 1 and Phase 2.

Once you have built a model that works to your satisfaction, you should run the model against molecules provided in either the valid.zip or test.zip file, and save the predicted properties in a file named answer.csv according to the format of train.csv file in dev.zip.

In short, answer.csv should store an N by 13 matrix where N is the number of molecules in valid.zip file during Phase 1 or the number of molecules in test.zip during Phase 2. The first column should be the GDB ID followed by 12 columns of molecular properties. The data entries should be sorted in an ascending order of GDB ID.

The answer.csv file should then be zipped and named submission.zip before uploading for evaluation.

Join the Alchemy Contest  

* A Codalab account is required for your submission.

Available on
5/22/2019 - 8/7/2019 Phase 1 (Development),
8/8/2019 - 10/7/2019 Phase 2 (Evaluation)

 

Have Questions?

For general questions, please ask at the Codalab forum of Alchemy Contest

Problem with datasets and/or dataloaders? Submit an issue on Alchemy on Github

 

Note. Tencent has the right to adjust the competition rules, prize information, time of competition and other aspects of the contest, relevant requirements and specifications according to the operation situation of the competition, and all other content involved in the contest shall be subject to final confirmation by Tencent.

Alchemy Codalab Leaderboard

loading leaderboard...
↓ view phase 1 leaderboard (7/9/2019 updated) ↓ view leaderboard before 6/20/2019

Alchemy Codalab Submission Panel

Organizers and Sponsors

Organizer

Professor Shengyu Zhang (Director of Tencent Quantum Lab)

Prof. Shengyu Zhang

Shengyu Zhang, Distinguished Scientist in Tencent; Associate professor, Department of Computer Science and Engineering (CSE) at The Chinese University of Hong Kong (CUHK)

Shengyu Zhang obtained his B.S. in mathematics, Fudan University in 1999, his M.S. in computer science, Tsinghua University in 2002 (under the supervision of Prof. Mingsheng Ying). And obtained his Ph.D. in computer science, Princeton University in 2006 (under the supervision of Prof. Andrew Chi-Chih Yao).

After working in NEC Laboratories America as a summer intern, he moved to California Institute of Technology for a two-year postdoc, under the supervision of Prof. Alexei Kitaev, Prof. John Preskill, and Prof. Leonard Schulman.

Sponsor

Professor Richard Zemel (Research Director of Vector Institute)

Prof. Richard Zemel

Richard Zemel is a Professor of Computer Science at the University of Toronto, where he has been a faculty member since 2000. Prior to that, he was an Assistant Professor in Computer Science and Psychology at the University of Arizona and a Postdoctoral Fellow at the Salk Institute and at Carnegie Mellon University. He received a BSc degree in History & Science from Harvard University in 1984 and a PhD in Computer Science from the University of Toronto in 1993. He is also the co-founder of SmartFinance, a financial technology start-up specializing in data enrichment and natural language processing.

His research contributions include foundational work on systems that learn useful representations of data without any supervision; methods for learning to rank and recommend items; and machine learning systems for automatic captioning and answering questions about images. He developed the Toronto Paper Matching System, a system for matching paper submissions to reviewers, which is being used in many conferences, including NIPS, ICML, CVPR, ICCV, and UAI. His research is supported by grants from NSERC, CIFAR, Microsoft, Google, Samsung, DARPA and iARPA.

His awards include an NVIDIA Pioneers of AI Award, a Young Investigator Award from the Office of Naval Research, a Presidential Scholar Award, and two NSERC Discovery Accelerators. Rich is a Fellow of the Canadian Institute for Advanced Research and is on the Executive Board of the Neural Information Processing Society, which runs the premier international machine learning conference.

Sponsor

Professor Jie Tang (Tsinghua University, Beijing)

Prof. Jie Tang

Jie Tang is a Full Professor and the Vice Chair of the Department of Computer Science and Technology at Tsinghua University. His interests include data mining, social networks, knowledge graph, machine learning, and artificial intelligence. He has been visiting scholar at Cornell University, Hong Kong University of Science and Technology, and Southampton University. He has published more than 300 journal/conference papers and holds 20 patents. His papers have been cited by more than 12,000 times.

He served as PC Co-Chair of CIKM’16, WSDM’15, Associate General Chair of KDD’18, and Acting Editor-in-Chief of ACM TKDD, Editors of IEEE TKDE/TBD and ACM TIST.

He leads the project AMiner.org for academic social network analysis and mining, which has attracted more than 10 million independent IP accesses from 220 countries/regions in the world.

He was honored with the UK Royal Society-Newton Advanced Fellowship Award, CCF Young Scientist Award, NSFC for Distinguished Young Scholar, and KDD’18 Service Award.