Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark Use case for Education Industry

Use case to detect at-risk students in academic system so that actions can be taken to help them.

  • Login to see the comments

  • Be the first to like this

Apache Spark Use case for Education Industry

  1. 1. © 2016 IBM Corporation1 Academic Alert System Presenter: Vinayak Agrawal Vagrawal@us.ibm.com
  2. 2. © 2016 IBM Corporation2 Agenda  Use Case  Use Case Architecture/Work Flow in Weka  Data Volume  Problem Statement  Our Analytical Platform  Spark Workflow  Result Comparison between Weka and Spark  Spark Challenges  Q&A
  3. 3. © 2016 IBM Corporation3 Use Case: Academic Alert System  Academic Institutions get performance based funding on parameters* like  Student Retention – Retention Rates  Student Graduating – Completion Rates  Academic Institutions wants to be proactive in providing academic feedback to students BEFORE they appear in final exam. *Source: http:///www.ncsl.org/research/education/performance-funding.aspx Develop a ML model which has the capability to predict at-risk (who might fail) students and provide this feedback to students and Professors so that they can take appropriate actions
  4. 4. © 2016 IBM Corporation4 Use Case: Academic Alert System in Weka
  5. 5. © 2016 IBM Corporation5 Data Volume (in Prod) Learning Management Systems 1) Student Activity data Total = ~ 350 million records Research = 15-18 million records 2) Student Gradebook data Total = ~ 1.5 million Research = 100,000 per semester Student Information systems 1) Demographics Research = 5500 students per semester x 3 2) Enrollment Research = 27000 per semester x 3 3) Course Research = ~2000 per semester x 3
  6. 6. © 2016 IBM Corporation6 Problem Statement Small universities have less students so Weka might work Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already been increased, because the Explorer always loads the entire dataset into the computer's main memory. To scale out for Larger Universities How do I process 45000 students with 20 features?
  7. 7. © 2016 IBM Corporation7 Analytical Platform  Hardware:  3 Virtual Machines on IBM PureFlex • 8 cores per VM • 32 GB RAM, 100GB per VM  Software:  3 node Hadoop cluster • Spark 1.5.2: Zeppelin, Python, Scala • Oozie, Hive and Sqoop
  8. 8. © 2016 IBM Corporation9 Spark Work Flow Data Training Test Sampling Train_DataImputation ModelImputation Test_Data Fit Transform Predictions
  9. 9. © 2016 IBM Corporation10 What does our Data Look like?  Data Sources: Derived from ETL stage  19 features from Learning Management System & Student Demographics Count: Training: 9923 Testing: 5145
  10. 10. © 2016 IBM Corporation11 Sampling Label Count 0.0 9267 1.0 656 Label Count 0.0 9267 1.0 9184 1.0 = Student At Risk Training Data was skewed with only 656 At-Risk Students so we duplicated At-Risk rows TRAINING DATA
  11. 11. © 2016 IBM Corporation12 Imputation  Filling with mean value for numerical columns  Age  SAT scores  Filling with Mode value for Categorical columns  Enrollment Status
  12. 12. © 2016 IBM Corporation13 Modelling Using Spark ML Package Why? DataFrame Build the Pipeline Model String Indexer for Categorical Variables Vector Assembler Use Model 4 Lines of Code 1 lr = LogisticRegression(maxIter=100, regParam=0.01) 2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer, RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler, lr]) 3 model_lr = pipeline_lr.fit(trainData) 4 prediction_lr = model_lr.transform(testData)
  13. 13. © 2016 IBM Corporation14 Logistic Regression Results Predicted Actual 0 1 0 4065 720 1 51 309 Spark:  Test Data count: 5145  19 Features Weka:  Test Data count: 5145  19 Features Predicted Actual 0 1 0 4093 692 1 49 311 309 Students at Risk 85.01 % Accuracy 85.83 % Recall Time: 20 seconds 311 Students at Risk 85.6 % Accuracy 86.4 % Recall Time: 49 Seconds
  14. 14. © 2016 IBM Corporation15 Random Forest Comparison Predicted Actual 0 1 0 4065 720 1 51 309 Spark: Data count: 5145 19 Features Weka: Data count: 5145 19 Features Predicted Actual 0 1 0 4186 599 1 83 277 309 Students at Risk 85.01 % Accuracy 85.83 % Recall Time:16 Seconds 277 Students at Risk 86.7 % Accuracy 76.9 % Recall Time:30 Seconds
  15. 15. © 2016 IBM Corporation16 Naive Bayes Comparison Predicted Actual 0 1 0 4279 506 1 158 202 Spark: Data count: 5145 19 Features Weka: Data count: 5145 19 Features Predicted Actual 0 1 0 4093 692 1 67 293 202 Students at Risk 87.1 % Accuracy 56.1 % Recall Time:9 Seconds 293 Students at Risk 85.2 % Accuracy 81.4 % Recall Time:30 Seconds
  16. 16. © 2016 IBM Corporation17 Why is this Better? Data Training Test Sampling Train_DataImputation ModelImputation Test_Data Fit Transform Predictions • Complete Work Flow in one Environment Zeppelin on Spark • Java/Scala or Python to choose from • Distributed Computing
  17. 17. © 2016 IBM Corporation18 Spark Challenges  No Python support to save and load pipeline model yet • SPARK-6725, SPARK-13032  ML StringIndexer does not protect itself from column name duplication • SPARK-12874  PySpark CrossValidatorModel does not support avgMetrics • SPARK-12810 • You have to create an RDD and then extract the metrics  PMML Export not supported yet • SPARK-11171
  18. 18. © 2016 IBM Corporation19 Q&A
  19. 19. © 2016 IBM Corporation20 LOGISTIC REGRESSION MODEL
  20. 20. © 2016 IBM Corporation21 Random Forest Code
  21. 21. © 2016 IBM Corporation22 Naïve Bayes Code
  22. 22. © 2016 IBM Corporation23 Appendix
  23. 23. © 2016 IBM Corporation24 IBM Open Platform for Apache Hadoop (IOP)  Includes Spark  100% Open Source  Implement with help from IBM Lab Services  Production Support Offering Available Apache Open Source Components HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene IBM Open Platform with Apache Hadoop
  24. 24. © 2016 IBM Corporation25 Questions??

×