Responsibilities
1. R&D for an efficient, real-time and reliable distributed computing engine
2. R&D for our distributed computing platform and machine learning training job scheduling system. Optimize its stability, performance and other aspects
3. Achieve an in-depth understanding of our business model and ideate abstract solutions to solve business problems.
Qualifications
Job Requirements
1. Proficient in Java/Python/C++ and other programming languages and common algorithms, with large-scale distributed system R&D and optimization capabilities
2. In-depth research and relevant experience in the open source computing framework (Hadoop MapReduce/Spark/Flink is preferred)
3. In-depth research and related experience in the cluster resource management system (Hadoop Yarn/Mesos/Kubernetes is preferred)
Preferred
1. In-depth research and experience in machine learning training and scheduling framework
2. Experience in ultra-large scale cluster operation and management.