Yuanzhe Cai is seeking a full-time software engineer position. He has a Ph.D. in computer science from UT Arlington and experience developing database, big data, and social network analysis projects. His research focused on inferring answer quality and expertise in question/answer communities. He has strong skills in Java, databases, and data mining tools.
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Theoretical work submitted to the Journal should be original in its motivation or modeling structure. Empirical analysis should be based on a theoretical framework and should be capable of replication. It is expected that all materials required for replication (including computer programs and data sets) should be available upon request to the authors.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
This slides are for a presentation at the 2009 IEEE/WIC/ACM International Conference on Web Intelligence. The major emphasis to this paper is concentrating on how to provide more personalized search support for a specific user considering his/her historical interests or recent interests. Cognitive memory retention like models are proposed and implemented in this system. Other supporting functionalities, such as domain distribution support, etc. are briefly mentioned. The whole paper can be downloaded from http://www.iwici.org/~yizeng/papers/WI2009-camera-ready.pdf
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
Introduction to Big Data and its Potential for Dementia ResearchDavid De Roure
Presentation at Dementia Conference (Evington Initiative) held at Wellcome Trust, 22-23 October 2012. Acknowledgements to McKinsey & Company, also Tim Clark (MGH) and Iain Buchan (University of Manchester), for input to slides.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Application of hidden markov model in question answering systemsijcsa
By the increase of the volume of the saved information on web, Question Answering (QA) systems have been very important for Information Retrieval (IR). QA systems are a specialized form of information retrieval. Given a collection of documents, a Question Answering system attempts to retrieve correct answers to questions posed in natural language. Web QA system is a sample of QA systems that in this system answers retrieval from web environment doing. In contrast to the databases, the saved information on web does not follow a distinct structure and are not generally defined. Web QA systems is the task of automatically answering a question posed in Natural Language Processing (NLP). NLP techniques are used in applications that make queries to databases, extract information from text, retrieve relevant documents from a collection, translate from one language to another, generate text responses, or recognize spoken words converting them into text. To find the needed information on a mass of the non-structured information we have to use techniques in which the accuracy and retrieval factors are implemented well. In this paper in order to well IR in web environment, The QA system in designed and also implemented based on the Hidden Markov Model (HMM)
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Theoretical work submitted to the Journal should be original in its motivation or modeling structure. Empirical analysis should be based on a theoretical framework and should be capable of replication. It is expected that all materials required for replication (including computer programs and data sets) should be available upon request to the authors.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
This slides are for a presentation at the 2009 IEEE/WIC/ACM International Conference on Web Intelligence. The major emphasis to this paper is concentrating on how to provide more personalized search support for a specific user considering his/her historical interests or recent interests. Cognitive memory retention like models are proposed and implemented in this system. Other supporting functionalities, such as domain distribution support, etc. are briefly mentioned. The whole paper can be downloaded from http://www.iwici.org/~yizeng/papers/WI2009-camera-ready.pdf
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
Introduction to Big Data and its Potential for Dementia ResearchDavid De Roure
Presentation at Dementia Conference (Evington Initiative) held at Wellcome Trust, 22-23 October 2012. Acknowledgements to McKinsey & Company, also Tim Clark (MGH) and Iain Buchan (University of Manchester), for input to slides.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Application of hidden markov model in question answering systemsijcsa
By the increase of the volume of the saved information on web, Question Answering (QA) systems have been very important for Information Retrieval (IR). QA systems are a specialized form of information retrieval. Given a collection of documents, a Question Answering system attempts to retrieve correct answers to questions posed in natural language. Web QA system is a sample of QA systems that in this system answers retrieval from web environment doing. In contrast to the databases, the saved information on web does not follow a distinct structure and are not generally defined. Web QA systems is the task of automatically answering a question posed in Natural Language Processing (NLP). NLP techniques are used in applications that make queries to databases, extract information from text, retrieve relevant documents from a collection, translate from one language to another, generate text responses, or recognize spoken words converting them into text. To find the needed information on a mass of the non-structured information we have to use techniques in which the accuracy and retrieval factors are implemented well. In this paper in order to well IR in web environment, The QA system in designed and also implemented based on the Hidden Markov Model (HMM)
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsHong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
Top cited computer science and engineering survey research articles from 2016...IJCSES Journal
The AIRCC's International Journal of Computer Science and Engineering Survey (IJCSES) is devoted to fields of Computer Science and Engineering surveys, tutorials and overviews. The IJCSES is a peer-reviewed scientific journal published in electronic form as well as print form. The journal will publish research surveys, tutorials and expository overviews in computer science and engineering. Articles from supplementary fields are welcome, as long as they are relevant to computer science and engineering.
J48 and JRIP Rules for E-Governance DataCSCJournals
Data are any facts, numbers, or text that can be processed by a computer. Data Mining is an analytic process which designed to explore data usually large amounts of data. Data Mining is often considered to be \"a blend of statistics. In this paper we have used two data mining techniques for discovering classification rules and generating a decision tree. These techniques are J48 and JRIP. Data mining tools WEKA is used in this paper.
Study on potential capabilities of a nodb systemijitjournal
There is a need of optimal data to query processing technique to handle the increasing database size,
complexity, diversity of use. With the introduction of commercial website, social network, expectations are
that the high scalability, more flexible database will replace the RDBMS. Complex application and Big
Table require highly optimized queries. Users are facing the increasing bottlenecks in their data analysis. A
growing part of the database community recognizes the need for significant and fundamental changes to
database design. A new philosophy for creating database systems called noDB aims at minimizing the datato-
query time, most prominently by removing the need to load data before launching queries. That will
process queries without any data preparation or loading steps. There may not need to store data. User can
pipe raw data from websites, DBs, excel sheets into two promise sample inputs without storing anything.
This study is based on PostgreSQL systems. A series of the baseline experiment are executed to evaluate the
Performance of this system as per -a. Data loading cost, b-Query processing timing, c-Avoidance of
Collision and Deadlock, d-Enabling the Big data storage and e-Optimize query processing etc. The study
found significant possible capabilities of noDB system over the traditional database management system.
95Orchestrating Big Data Analysis Workflows in the Cloud.docxfredharris32
95
Orchestrating Big Data Analysis Workflows in the Cloud:
Research Challenges, Survey, and Future Directions
MUTAZ BARIKA and SAURABH GARG, University of Tasmania
ALBERT Y. ZOMAYA, University of Sydney
LIZHE WANG, China University of Geoscience (Wuhan)
AAD VAN MOORSEL, Newcastle University
RAJIV RANJAN, China University of Geoscience (Wuhan) and Newcastle University
Interest in processing big data has increased rapidly to gain insights that can transform businesses, govern-
ment policies, and research outcomes. This has led to advancement in communication, programming, and
processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and
Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but com-
posed of several individual analytical steps running in the form of a workflow. These big data workflows are
vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how
to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration
requirements of these workflows as well as the challenges in achieving these requirements. We also survey
current trends and research that supports orchestration of big data workflows and identify open research
challenges to guide future developments in this area.
CCS Concepts: • General and reference → Surveys and overviews; • Computer systems organization
→ Cloud computing; • Computing methodologies → Distributed algorithms; • Computer systems
organization → Real-time systems;
Additional Key Words and Phrases: Big data, cloud computing, workflow orchestration, research taxonomy,
approaches, and techniques
ACM Reference format:
Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2019. Or-
chestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
ACM Comput. Surv. 52, 5, Article 95 (September 2019), 41 pages.
https://doi.org/10.1145/3332301
This research is supported by an Australian Government Research Training Program (RTP) Scholarship. This research is
also partially funded by two Natural Environment Research Council (UK) projects including (LANDSLIP:NE/P000681/1
and FloodPrep:NE/P017134/1).
Authors’ addresses: M. Barika and S. Garg, Discipline of ICT | School of Technology, Environments and Design (TED),
College of Sciences and Engineering, University of Tasmania, Hobart, Tasmania, Australia 7001; emails: {mutaz.barika,
saurabh.garg}@utas.edu.au; A. Zomaya, School of Computer Science, Faculty of Engineering, J12 - Computer Science Build-
ing. The University of Sydney, Sydney, New South Wales, Australia; email: [email protected]; L. Wang, School
of Computer Science, China University of Geoscience, No. 388 Lumo Road, Wuhan, P. R China; email: [email protected];
A Van Moorsel, School of Computing, Newcastle University,.
95Orchestrating Big Data Analysis Workflows in the Cloud.docxblondellchancy
95
Orchestrating Big Data Analysis Workflows in the Cloud:
Research Challenges, Survey, and Future Directions
MUTAZ BARIKA and SAURABH GARG, University of Tasmania
ALBERT Y. ZOMAYA, University of Sydney
LIZHE WANG, China University of Geoscience (Wuhan)
AAD VAN MOORSEL, Newcastle University
RAJIV RANJAN, China University of Geoscience (Wuhan) and Newcastle University
Interest in processing big data has increased rapidly to gain insights that can transform businesses, govern-
ment policies, and research outcomes. This has led to advancement in communication, programming, and
processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and
Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but com-
posed of several individual analytical steps running in the form of a workflow. These big data workflows are
vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how
to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration
requirements of these workflows as well as the challenges in achieving these requirements. We also survey
current trends and research that supports orchestration of big data workflows and identify open research
challenges to guide future developments in this area.
CCS Concepts: • General and reference → Surveys and overviews; • Computer systems organization
→ Cloud computing; • Computing methodologies → Distributed algorithms; • Computer systems
organization → Real-time systems;
Additional Key Words and Phrases: Big data, cloud computing, workflow orchestration, research taxonomy,
approaches, and techniques
ACM Reference format:
Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2019. Or-
chestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
ACM Comput. Surv. 52, 5, Article 95 (September 2019), 41 pages.
https://doi.org/10.1145/3332301
This research is supported by an Australian Government Research Training Program (RTP) Scholarship. This research is
also partially funded by two Natural Environment Research Council (UK) projects including (LANDSLIP:NE/P000681/1
and FloodPrep:NE/P017134/1).
Authors’ addresses: M. Barika and S. Garg, Discipline of ICT | School of Technology, Environments and Design (TED),
College of Sciences and Engineering, University of Tasmania, Hobart, Tasmania, Australia 7001; emails: {mutaz.barika,
saurabh.garg}@utas.edu.au; A. Zomaya, School of Computer Science, Faculty of Engineering, J12 - Computer Science Build-
ing. The University of Sydney, Sydney, New South Wales, Australia; email: [email protected]; L. Wang, School
of Computer Science, China University of Geoscience, No. 388 Lumo Road, Wuhan, P. R China; email: [email protected];
A Van Moorsel, School of Computing, Newcastle University, ...
Stacked Generalization of Random Forest and Decision Tree Techniques for Libr...IJEACS
The huge amount of library data stored in our modern research and statistic centers of organizations is springing up on daily bases. These databases grow exponentially in size with respect to time, it becomes exceptionally difficult to easily understand the behavior and interpret data with the relationships that exist between attributes. This exponential growth of data poses new organizational challenges like the conventional record management system infrastructure could no longer cope to give precise and detailed information about the behavior data over time. There is confusion and novel concern in selecting tools that can support and handle big data visualization that deals with multi-dimension. Viewing all related data at once in a database is a problem that has attracted the interest of data professionals with machine learning skills. This is a lingering issue in the data industry because the existing techniques cannot be used to remove or filter noise from relevant data and pad up missing values in order to get the required information. The aim is to develop a stacked generalization model that combines the functionality of random forest and decision tree to visualization library database visualization. In this paper, the random forest and decision tree techniques were employed to effectively visualize large amounts of school library data. The proposed system was implemented with a few lines of Python code to create visualizations that can help users at a glance understand and interpret the behavior of data and its relationships. The model was trained and tested to learn and extract hidden patterns of data with a cross-validation test. It combined the functionalities of both models to form a stacked generalization model that performed better than the individual techniques. The stacked model produced 95% followed by the RF which produced a 95% accuracy rate and 0.223600 RMSE error value in comparison with the DT which recorded an 80.00% success rate and 0.15990 RMSE value.
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...aciijournal
Advanced Computational Intelligence: An International Journal (ACII) is a quarterly open access peer-reviewed journal that publishes articles which contribute new results in all areas of computational intelligence. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced computational intelligence concepts and establishing new collaborations in these areas.
Classifier Model using Artificial Neural NetworkAI Publications
When it comes to AI and ML, precision in categorization is of the utmost importance. In this research, the use of supervised instance selection (SIS) to improve the performance of artificial neural networks (ANNs) in classification is investigated. The goal of SIS is to enhance the accuracy of future classification tasks by identifying and selecting a subset of examples from the original dataset. The purpose of this research is to provide light on how useful SIS is as a preprocessing tool for artificial neural network-based classification. The work aims to improve the input dataset to ANNs by using SIS, which may help with problems caused by noisy or redundant data. The ultimate goal is to improve ANNs' ability to identify data points properly across a wide range of application areas.
1. Yuanzhe Cai’s Curriculum Vitae The University of Texas at Arlington
CURRICULUM VITAE
GGeenneerraall IInnffoorrmmaattiioonn
Name: Dr. Yuanzhe Cai Gender: Male Age: 32
Address: D209 Via Lucca, Irvine, CA 92612
Email: yuanzhe.cai@gmail.com Mobile Telephone: (682) 240-5640
OObbjjeeccttiivveess
Seek for a full time software engineer
EEdduuccaattiioonn
The University of Texas at Arlington (Texas) 2009-2014 (GPA: 3.78/4.0)
Ph.D. in Computer Science and Engineering Supervisor: Prof. Sharma Chakravarthy
Dissertation: Inferring answer quality, answerer expertise, and ranking in question/answer social
networks.
Renmin University of China (Beijing) 2005-2008 (GPA 3.78/4.0)
M.S. in Computer Science and Engineering Supervisor: Prof. Xiaoyong Du
Thesis: A method for the similarity calculation on the large scale documents
Xidian University (Xi’an) 2001-2005 (GPA 3.6/4.0)
B.S. in Software Engineer
SSkkiillllss && HHaannddlleedd IInnssttrruummeennttss
Solid knowledge in database system, data mining and search engine
Hands-on database kernel components for PostgreSQL (2)
Good knowledge in Database Optimization (table indexing, query analyzing, performance tuning,
etc.) (5)
Good knowledge in Big data: Hadoop Framework (2) , PostgreSQL 9.4 (NoSQL Feature) (1),
MongoDB (1), Spark (0.5)
Proficiency with data mining tool and information retrieval software: weka (4) and lucene (1) system
Expertise in social networks analysis: Q/A community Analysis (5)
Expertise in recommendation system: book and social tagging recommendation system
Good knowledge in J2EE Optimization (JMS optimization, JBoss server performance tuning,
Hibernate performance tuning, etc.) (1)
Proficiency with: Java (10), C (5), Matlab (5), SQL (9), PL/SQL (1), J2EE Framework (3), VBA (0.5),
C++ (1), EJB (1), JBoss (1), MYSQL (2)
AAcchhiieevveemmeennttss
Database Kernel Development (PostgreSQL 8.3):
Result set cache development
Performance monitoring tools
Online Community Analysis (e.g., Facebook, Yahoo! Answers, Stack Overflow, etc):
Crawl more than 30G web original data
Manage more than 10 million question and answers pairs
2. Yuanzhe Cai’s Curriculum Vitae The University of Texas at Arlington
Design efficiently and effectively algorithms to analyze answerer’s behavior
Calculate user’s expertise
Recommend the questions to the proper users
Digital Library for Renmin University of China (2.0):
Retrieve the book according to the keywords using lucence system
Personally recommend a proper book for a user
This system was used in Renmin University Library
Healthcare Radiology Information System (RIS):
Full-Stack developer for J2EE framework.
System performance turning (DB and JBoss server tuning).
PPrroojjeeccttss
06/2015- Radiology Information System (RIS) Re-Architecture (JBoss, EJB, JAVA, MYSQL etc)
Full-Stack developer (schedule, worklist, report)
Improve the system performance. (JMS tuning, DB tuning, JBoss tuning, etc)
Data Migration Tool. (Migration from RIS(v2.3) to RIS(2.8))
Import Tool. (Import excel data into RIS system)
09/2009-05/2014 Identifying Expertise and Answer Quality in Q/A Social Networks (Java, Hadoop
framework, MongoDB, PostgreSQL and Matlab)
Answer Quality Prediction in Q/A Social Network by Leveraging Temporal Feature
Expertise Ranking of Users in Q/A Community
Identify the specialist for a particular question or a domain in the Q/A community.
Social tagging recommendation using the tensor decomposition in the Q/A community
02/2008-08/2009 Document Similarity Analysis (Java)
Develop an algorithm to calculate the document similarity.
Improve SimRank algorithm from 2 days to 2 minutes (100 thousands nodes for the citation
graph)
09/2007-02/2008 Code system development (ontology management system) (Java, PLSQL, Oracle)
Implement the different kinds of relationship, instance and class for ontology.
This system was used in the Database & Intelligent Information Retrieval Lab.
02/2007-09/2007 Digital Library for for Renmin University of China (Java, Lucene)
Develop the book retrieval system 2.0 using lucene
Personally recommend a proper book for a user using item based and user based algorithms
07/2006-02/2007 Database Performance Monitoring (C and PostgreSQL)
Develop a group of database views to monitor the database performance, such as io, buffer, file,
lock, event, log information, etc.
This monitor was used in Kingbase v.4.1.
03/2006-07/2006 Cadre evaluation system of the CPC Central Committee (VB, PowerDesigner,
Kingbase 4.1 and VBA)
Develop the cadre evaluation system for the CPC Central committee.
Implement the database design and UI program
This system was used for cadres’ election for 23 provinces in China.
09/2005-03/2006 SQL result set cache (C and PostgreSQL)
Implemented both client memory cache and share memory cache.
3. Yuanzhe Cai’s Curriculum Vitae The University of Texas at Arlington
Take the TPC-C test.
This result set cache was used in Kingbase v4.1.
07/2005-09/2006 Japanese healthcare software system (Java, J2EE framework, Tomcat and
Oracle)
Build the J2EE framework using Struts + Spring + Hibernate
Implement the database design for the healthcare system
02/2005-06/2006 Business customer behavior analysis system (Java, JSP and Tomcat)
Using data mining technique to analyze the customer’s behaviors
Implement data mining algorithms, such as ID3 tree classifier, naive bayes classifier, k-mean
cluster, apriori algorithm, and preprocess, etc.
Direct a group (5 persons) to implement system
WWoorrkk EExxppeerriieennccee && IInntteerrnnsshhiipp
06/15- Candelis Inc.
Software Engineer for the Backend Performance
09/14-06/15 University of Texas at Arlington
Lecturer for CSE 2320 (Data structure and Algorithm) and CSE 5311 (Algorithm)
09/09-05/14 University of Texas at Arlington
TA for C, Java, Database, Data Structure, Computer Architecture, etc.
09/05 - 01/07 Beijing BaseSoft Co., Ltd., PostgreSQL kernel development
Database Kernel Developer
07/05 - 09/05 Shanghai Xinyou Co., Ltd., Japanese healthcare software system
Software Engineer
02/05 - 06/05 Xi’an Software Park, Business customer behavior analysis system
Software Engineer
HHoonnoorrss && SScchhoollaarrsshhiippss
TA Fellowship at University of Texas, Arlington from fall 2009 to May 2014.
Three Years Fellowship in Renmin University of China from 2005-2008
The third-class Scholarship in Xidian University in 2004 and 2005
IBM Web Sphere Certification
The third-class Math Model in Xidian University in 2004
PPaatteennttss
Xiaoyong Du, Hongyan Liu, Jun He, Yuanzhe Cai and Pei Li, A method of the document similarity
calculation, Patent Id. CN101576903B
Xiaoyong Du, Hongyan Liu, Jun He, Pei Li and Yuanzhe Cai, Efficient similarity calculation on a
graph using block structure, Patent Id. CN101576905B
Xiaoyong Du, Hongyan Liu, Jun He, Yuanzhe Cai and Xu Jia, Explore the power law distribution on
a graph for efficient similarity calculation, Patent Id. CN101853281A
PPuubblliiccaattiioonnss
[1] Yuanzhe Cai, Sharma Chakravarthy: Answer Quality Prediction in Q/A Social Networks by
Leveraging Temporal Features. International Journal of Next-Generation Computing, Volume 4,
2013
4. Yuanzhe Cai’s Curriculum Vitae The University of Texas at Arlington
[2] Yuanzhe Cai, Sharma Chakravarthy: Expertise Ranking of Users in QA Community. Database
Systems for Advanced Applications, 18th International Conference, DASFAA 2013, Wuhan, China,
April 22-25, 2013
[3] Yuanzhe Cai, Sharma Chakravarthy: Pairwise Similarity Calculation of Information Networks. Data
Warehousing and Knowledge Discovery - 13th International Conference, DaWaK 2011, Toulouse,
France, August 29-September 2,2011.
[4] Yuanzhe Cai, Miao Zhang, Dijun Luo, Chris H. Q. Ding, Sharma Chakravarthy: Low-order tensor
decompositions for social tagging recommendation. Proceedings of the Forth International
Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12,
2011.
[5] Xu Jia, Hongyan Liu, Li Zou, Jun He, Xiaoyong Du, Yuanzhe Cai: Local Methods for Estimating
SimRank Score. Advances in Web Technologies and Applications, Proceedings of the 12th
Asia-Pacific Web Conference, APWeb 2010, Busan, Korea, 6-8 April 2010.
[6] Yuanzhe Cai, Miao Zhang, Chris H. Q. Ding, Sharma Chakravarthy: Closed form solution of
similarity algorithms. Proceeding of the 33rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010.
[7] Xu Jia, Yuanzhe Cai, Hongyan Liu, Jun He, Xiaoyong Du: Calculating Similarity Efficiently in a
Small World. Advanced Data Mining and Applications, 5th International Conference, ADMA 2009,
Beijing, China, August 17-19, 2009.
[8] Yuanzhe Cai, Hongyan Liu, Jun He, Xiaoyong Du, Xu Jia: An Adaptive Method for the Efficient
Similarity Calculation. Database Systems for Advanced Applications, 14th International Conference,
DASFAA 2009, Brisbane, Australia, April 21-23, 2009.
[9] Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient
Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009, The Ninth
IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009.
[10] Pei Li, Yuanzhe Cai, Hongyan Liu, Jun He, Xiaoyong Du: Exploiting the Block Structure of Link
Graph for Efficient Similarity Computation. Advances in Knowledge Discovery and Data Mining,
13th Pacific-Asia Conference, PAKDD 2009, Bangkok, Thailand, April 27-30, 2009, Proceedings.
[11] Yuanzhe Cai, Pei Li, Hongyan Liu, Jun He, Xiaoyong Du: S-SimRank: Combining Content and Link
Information to Cluster Papers Effectively and Efficiently. Advanced Data Mining and Applications,
4th International Conference, ADMA 2008, Chengdu, China, October 8-10, 2008.
[12] Yuanzhe Cai and Sharma Chakravarthy. Identifying Specialists for Concepts. 18th International
Conference on Extending Database Technology, March 23-27, 2015 - Brussels, Belgium
(submitted).
[13] Yuanzhe Cai and Sharma Chakravarthy. HITS vs. Non-negative Matrix Factorization. Technique
Report, 2014.
RReeffeerreenncceess
Dr. Sharma Chakravarthy, Professor, Department of Computer Science and Engineering, UT
Arlington, email: sharma@cse.uta.edu, Phone: (817) 272-2082
Dr. Chris Ding, Professor, Department of Computer Science and Engineering, UT Arlington, email:
CHQDing@uta.edu, Phone: (817) 272-7041
Dr. Deguang Kong, Senior Research Engineer, Samsung Electronics, email: doogkong@gmail.com,
Phone: (408) 718-4906
Ming Ge, RIS Project Manager, Candelis Inc, email: ming.ge@candelis.com, Phone: (917)348-8560