CSE doctoral student Zhiyu Chen will apply the specialized skills he developed at Lehigh when he joins Amazon’s Alexa Shopping team after graduation

“Data science is a tool that can satisfy my curiosity,” says Zhiyu Chen, a doctoral student in computer science and engineering. “It helps me to understand a lot of questions without being an expert in certain fields.” 

With his advisor, Professor Brian D. Davison, an expert in search engines, Chen is applying advanced machine learning techniques to better understand datasets. The project involves building a search engine specifically for datasets, something that Google and other mainstream search engines aren’t equipped to find.

"When you have more hands-on experience, you can develop better intuition about data. Therefore, mastering good programming skills is very important. Those are your diagnostic tools to help you answer specific questions you can think of. Data science is a field that develops very fast, so you should always be prepared to learn new things."
Zhiyu Chen

“Traditionally, people”—perhaps a journalist or a government employee—”are searching the descriptions of datasets that have been indexed by search engines like Google,” and then tracking down the information from there, says Davison. “But sometimes the information you’re looking for is in the dataset itself, and not in the description. So we’ve been working on building better representations of the datasets so they can be found more easily.”

Chen’s focus on computer science developed out of his interest in PC gaming—and his reliance on search tools. “I was always searching for gaming tips on the Internet,” he says, “so I decided to study in CS because I was very interested in the techniques behind game development, search engines, etc.” 

But he traces his curiosity back to his early childhood, when he dreamed of having a superpower—like the ability to control the result of a coin flip.  

“When studying probabilities in primary school, the textbook often assumes that each flip of a coin has an equal chance of coming up heads or tails. But when I did the experiments, I found out that it was not an equal chance, and the result could be affected by some factors, such as the initial position of the coin and the force when you flip.”

Today, Chen sees data science as a superpower, one that allows him to model those factors and see the possibility of controlling the outcome. “I have read a book about dice control written by a professional gambler,” he says, “and from my perspective, the author is really a data scientist who demystifies dice control with data science techniques.”

After completing his PhD, Chen will join Amazon's Alexa Shopping team as an applied scientist for the company’s popular virtual assistant. “Our mission is to build new machine learning models, so that Alexa can better understand customers and provide better services such as search and recommendation.”

During his graduate studies, Chen interned at Amazon, where he developed a new method for conversational question answering that enables a virtual assistant like Alexa to interact with users more naturally through a complete conversation.

"The most challenging part in my field is still about obtaining enough training data," he says, "and I think it is also the biggest challenge in all AI-related fields. That's why people begin to study how to use less data to train a good AI model. The model’s efficiency is another challenging problem. Since the models are heavier than before, they require more computational resources."

“I am lucky to see how AI/data science has been applied to almost everywhere in my life,” continues Zhiyu, “and I am sure the coverage will increase even more in the future.”

Zhiyu Chen
PhD student, computer science and engineering

Zhiyu Chen

What are the most important skills for a student to have or develop to be successful as a data scientist?

I think one of the most important skills is to learn how to derive questions from your curiosity. Sometimes, if you ask a series of good sub-questions, the original question may have been answered by 50 percent. Usually, sub-questions are easier to be answered and you may already have confident assumptions. Then the remaining tasks to do is more like verifying your assumptions with data science methods. Usually, the answers to those sub-questions can tell a complete story.