Invited talk at NSF/TCPP Workshop on Parallel and Distributed Computing Education Edupar at IPDPS 2015 May 25, 2015 5/25/2015 Hyderabad
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics https://bigdatacourse.appspot.com/course. The other is BDOSSP: Big Data Open Source Software and Projects http://bigdataopensourceprojects.soic.indiana.edu/
Data Science Curriculum at Indiana UniversityGeoffrey Fox
The document provides details about the Data Science curriculum at Indiana University. It discusses the background of the School of Informatics and Computing, including its establishment and inclusion of computer science, library and information science programs. It then describes the Data Science certificate and masters programs, including course requirements, tracks, and admissions. The programs aim to provide students with skills in data analysis, lifecycle, management, and applications through coursework in relevant technical areas.
DTW: 2015 Data Teaching Workshop – 2nd IEEE STC CC and RDA Workshop on Curricula and Teaching Methods in Cloud Computing, Big Data, and Data Science
as part of CloudCom 2015 (http://2015.cloudcom.org/), Vancouver, Nov 30-Dec 3, 2015.
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics; The other is BDOSSP: Big Data Open Source Software and Projects. Links are
http://openedx.scholargrid.org/ BDAA Fall 2015
http://datascience.scholargrid.org/ BDOSSP Spring 2016
http://bigdataopensourceprojects.soic.indiana.edu/ Spring 2015
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
Building the Data Science Profession in EuropeSteven Miller
The document discusses the EDISON Data Science Framework (EDSF) which aims to establish data science as a formal profession in Europe. The EDSF includes a Data Science Competence Framework (CF-DS) that defines essential competences and skills. It also includes a Data Science Body of Knowledge (DS-BoK) and a Data Science Model Curriculum (MC-DS) to guide educational and training programs. The framework analyzed job postings and existing standards to identify five core competence groups for data science: data analytics, data engineering, domain expertise, data management, and scientific/research methods.
1) Rensselaer Polytechnic Institute aims to incorporate data science education across its curriculum to develop "data dexterity" in every student.
2) A proposed core curriculum includes data-intensive courses in science and the major, as well as collaborative projects through a new Data INCITE laboratory.
3) The goal is for data management and analysis to become as fundamental as calculus, with open data sharing and verification of results.
Data Science Curriculum at Indiana UniversityGeoffrey Fox
The document provides details about the Data Science curriculum at Indiana University. It discusses the background of the School of Informatics and Computing, including its establishment and inclusion of computer science, library and information science programs. It then describes the Data Science certificate and masters programs, including course requirements, tracks, and admissions. The programs aim to provide students with skills in data analysis, lifecycle, management, and applications through coursework in relevant technical areas.
DTW: 2015 Data Teaching Workshop – 2nd IEEE STC CC and RDA Workshop on Curricula and Teaching Methods in Cloud Computing, Big Data, and Data Science
as part of CloudCom 2015 (http://2015.cloudcom.org/), Vancouver, Nov 30-Dec 3, 2015.
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics; The other is BDOSSP: Big Data Open Source Software and Projects. Links are
http://openedx.scholargrid.org/ BDAA Fall 2015
http://datascience.scholargrid.org/ BDOSSP Spring 2016
http://bigdataopensourceprojects.soic.indiana.edu/ Spring 2015
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
Building the Data Science Profession in EuropeSteven Miller
The document discusses the EDISON Data Science Framework (EDSF) which aims to establish data science as a formal profession in Europe. The EDSF includes a Data Science Competence Framework (CF-DS) that defines essential competences and skills. It also includes a Data Science Body of Knowledge (DS-BoK) and a Data Science Model Curriculum (MC-DS) to guide educational and training programs. The framework analyzed job postings and existing standards to identify five core competence groups for data science: data analytics, data engineering, domain expertise, data management, and scientific/research methods.
1) Rensselaer Polytechnic Institute aims to incorporate data science education across its curriculum to develop "data dexterity" in every student.
2) A proposed core curriculum includes data-intensive courses in science and the major, as well as collaborative projects through a new Data INCITE laboratory.
3) The goal is for data management and analysis to become as fundamental as calculus, with open data sharing and verification of results.
The document discusses the steps involved in the data science life cycle (DSLC). It describes the main steps as business understanding, data acquisition and understanding, modeling, deployment, and customer acceptance. It provides details on several of these steps, including business understanding, data acquisition and understanding, data modeling, and initial data exploration. The goal is to clearly outline the typical process and considerations for a data science project from defining the problem to exploring the available data.
The document provides an overview of an event on emerging trends in data science given by Dr. Joanne Luciano. It discusses the data science workflow and various processes involved. Some key trends highlighted include increased use of AI and machine learning in data management and reporting, growth of natural language processing, advances in deep learning, emphasis on data privacy and ethics. The document also promotes the new minor in data science offered at University of the Virgin Islands, covering required courses and examples of course sequences for different disciplines.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
Deploying Open Learning Analytics at a National Scalemichaeldwebb
This document discusses deploying open learning analytics at a national scale in the UK. It provides an overview of learning analytics, including definitions, stages of analytics, and a strategic view of national issues related to retention, achievement, and teaching excellence. Technically, it describes Jisc's open architecture and predictive modeling framework. Implementation trends seen in the field emphasize change management, piloting solutions, focusing on data quality, and keeping initial implementations simple.
Ontologies for Emergency & Disaster Management Stephane Fellah
Ogc meeting march 2014
OGC OWS-10 Cross-Community Interoperability
Ontologies for Emergency & Disaster Management
(The application of geospatial linked data)
Ontologies for Crisis Management: A Review of State of the Art in Ontology De...streamspotter
This document summarizes an existing review of ontologies for crisis management. The review identified 11 key subject areas related to crisis management concepts (e.g. disasters, infrastructure, organizations). It found 27 existing ontologies that cover these subject areas to varying degrees. Most ontologies were designed for general purposes, disaster archives, or response and relief. The review concluded that while existing ontologies cover important subject areas, some areas are not fully addressed and links between areas need improvement. It provided a basis for developing a new framework and ontology (Cihai) to improve semantic interoperability in crisis management.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
This document provides an overview of data analytics processes for learning and academic analytics projects. It discusses key data dimensions including computing, location, time, activity, physical conditions, resources, user attributes, and relations. It then covers applications of analytics for learners and teachers to monitor learning and improve performance. The document outlines the stages of an extract-transform-load data processing workflow. Finally, it discusses different methods for knowledge discovery including prediction, structure discovery through clustering and factor analysis, and relationship mining through association rules, correlations, sequential patterns and causal analysis.
The document introduces an advanced services engineering course. The course covers emerging techniques for engineering large-scale, elastic service systems for complex data analytics across clouds, IoT systems and human computation platforms. Topics include big/real-time data provisioning and analytics, quality-aware workflow design, and hybrid software/human service integration in multiple cloud environments. The course involves lectures, assignments, a mini-project and final exam. It is aimed at advanced master's and PhD students with distributed systems knowledge.
Cyberinfrastructure in Louisiana: From Black Holes to Hurricanes. Presentation at Cyberinfrastructure Days, Notre Dame, April 29-30, 2010. http://ci.nd.edu/
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...LIBER Europe
A presentation by Dr. Liz Lyon of the United Kingdom Office for Library and Information Networking, as given at LIBER's 42nd annual conference in Munich, Germany.
Enabling Data-Intensive Science Through Data InfrastructuresLIBER Europe
These slides are from a talk given at LIBER's 42nd annual conference by Carlos Morais Pires of the European Commission.
In light of the current data deluge, and plans by the European Commission to harness this deluge through the implementation of e-infrastructures for data driven science under Horizon 2020, Pires issued a call to action to libraries to engage in the data infrastructure and bring their own unique, and now much needed competencies, to bear in bringing meaning to, and spreading the word about, data-driven science.
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...Dr.-Ing. Thomas Hartmann
This document discusses leveraging the DDI (Data Documentation Initiative) model for linking statistical data in the social, behavioral, and economic sciences. It presents the DDI standard for documenting and managing research data, and how an DDI ontology was developed to publish DDI metadata as linked data. Key use cases for discovery of related studies, variables, and data files are described. The presentation concludes with acknowledgements.
This document summarizes a seminar on machine learning using big data. It discusses the history of data storage and traditional databases. It then introduces machine learning and the types of learning, including supervised and unsupervised learning. Specific algorithms for each type are covered such as k-means clustering for unsupervised and naive Bayes for supervised. Case studies on applications like Amazon product recommendations are presented. The document concludes by discussing tools for machine learning and future applications as more connected devices generate extensive data.
As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
Creating Interactive Dashboards with Microsoft ExcelAACRAO
Sign up to view the archived webinar here: http://www.aacrao.org/conferences/conferences-detail-view/creating-interactive-dashboards-in-excel
Other college’s dashboards making you see green even though it is not your school color? No budget for specialized dashboard programs? Can’t keep up with end-user demands for different analyses?
New features in Excel 2010 and 2013 allow even casual users to create interactive dashboards that are both functional and great looking allowing you and your end-users to explore your data in ways you have only imagined—allowing you to convert your data into actionable information.
Even if you are a Pivot Table novice, you can create functional and great looking dashboards. In this webinar, we will show you the basic steps for creating interactive dashboards in Excel 2010 and 2013. Taking a holistic SEM approach, we will examine several use-cases throughout the student lifecycle.
From setting up your data, to creating the dashboard and modifying it to your own school colors, we will cover the basics of setting up a simple, yet interactive and informative dashboards. Some basic knowledge of Pivot Tables is useful but not required.
Launch of the Week: Eastern Washington UniversityLaura Faccone
Each week, Schneider Associates analyzes the most significant brand, product, campaign or idea launch of the week. Learn more about launch at www.schneiderpr.com or email launch@schneiderpr.com.
The document discusses the steps involved in the data science life cycle (DSLC). It describes the main steps as business understanding, data acquisition and understanding, modeling, deployment, and customer acceptance. It provides details on several of these steps, including business understanding, data acquisition and understanding, data modeling, and initial data exploration. The goal is to clearly outline the typical process and considerations for a data science project from defining the problem to exploring the available data.
The document provides an overview of an event on emerging trends in data science given by Dr. Joanne Luciano. It discusses the data science workflow and various processes involved. Some key trends highlighted include increased use of AI and machine learning in data management and reporting, growth of natural language processing, advances in deep learning, emphasis on data privacy and ethics. The document also promotes the new minor in data science offered at University of the Virgin Islands, covering required courses and examples of course sequences for different disciplines.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
Deploying Open Learning Analytics at a National Scalemichaeldwebb
This document discusses deploying open learning analytics at a national scale in the UK. It provides an overview of learning analytics, including definitions, stages of analytics, and a strategic view of national issues related to retention, achievement, and teaching excellence. Technically, it describes Jisc's open architecture and predictive modeling framework. Implementation trends seen in the field emphasize change management, piloting solutions, focusing on data quality, and keeping initial implementations simple.
Ontologies for Emergency & Disaster Management Stephane Fellah
Ogc meeting march 2014
OGC OWS-10 Cross-Community Interoperability
Ontologies for Emergency & Disaster Management
(The application of geospatial linked data)
Ontologies for Crisis Management: A Review of State of the Art in Ontology De...streamspotter
This document summarizes an existing review of ontologies for crisis management. The review identified 11 key subject areas related to crisis management concepts (e.g. disasters, infrastructure, organizations). It found 27 existing ontologies that cover these subject areas to varying degrees. Most ontologies were designed for general purposes, disaster archives, or response and relief. The review concluded that while existing ontologies cover important subject areas, some areas are not fully addressed and links between areas need improvement. It provided a basis for developing a new framework and ontology (Cihai) to improve semantic interoperability in crisis management.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
This document provides an overview of data analytics processes for learning and academic analytics projects. It discusses key data dimensions including computing, location, time, activity, physical conditions, resources, user attributes, and relations. It then covers applications of analytics for learners and teachers to monitor learning and improve performance. The document outlines the stages of an extract-transform-load data processing workflow. Finally, it discusses different methods for knowledge discovery including prediction, structure discovery through clustering and factor analysis, and relationship mining through association rules, correlations, sequential patterns and causal analysis.
The document introduces an advanced services engineering course. The course covers emerging techniques for engineering large-scale, elastic service systems for complex data analytics across clouds, IoT systems and human computation platforms. Topics include big/real-time data provisioning and analytics, quality-aware workflow design, and hybrid software/human service integration in multiple cloud environments. The course involves lectures, assignments, a mini-project and final exam. It is aimed at advanced master's and PhD students with distributed systems knowledge.
Cyberinfrastructure in Louisiana: From Black Holes to Hurricanes. Presentation at Cyberinfrastructure Days, Notre Dame, April 29-30, 2010. http://ci.nd.edu/
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...LIBER Europe
A presentation by Dr. Liz Lyon of the United Kingdom Office for Library and Information Networking, as given at LIBER's 42nd annual conference in Munich, Germany.
Enabling Data-Intensive Science Through Data InfrastructuresLIBER Europe
These slides are from a talk given at LIBER's 42nd annual conference by Carlos Morais Pires of the European Commission.
In light of the current data deluge, and plans by the European Commission to harness this deluge through the implementation of e-infrastructures for data driven science under Horizon 2020, Pires issued a call to action to libraries to engage in the data infrastructure and bring their own unique, and now much needed competencies, to bear in bringing meaning to, and spreading the word about, data-driven science.
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...Dr.-Ing. Thomas Hartmann
This document discusses leveraging the DDI (Data Documentation Initiative) model for linking statistical data in the social, behavioral, and economic sciences. It presents the DDI standard for documenting and managing research data, and how an DDI ontology was developed to publish DDI metadata as linked data. Key use cases for discovery of related studies, variables, and data files are described. The presentation concludes with acknowledgements.
This document summarizes a seminar on machine learning using big data. It discusses the history of data storage and traditional databases. It then introduces machine learning and the types of learning, including supervised and unsupervised learning. Specific algorithms for each type are covered such as k-means clustering for unsupervised and naive Bayes for supervised. Case studies on applications like Amazon product recommendations are presented. The document concludes by discussing tools for machine learning and future applications as more connected devices generate extensive data.
As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
Creating Interactive Dashboards with Microsoft ExcelAACRAO
Sign up to view the archived webinar here: http://www.aacrao.org/conferences/conferences-detail-view/creating-interactive-dashboards-in-excel
Other college’s dashboards making you see green even though it is not your school color? No budget for specialized dashboard programs? Can’t keep up with end-user demands for different analyses?
New features in Excel 2010 and 2013 allow even casual users to create interactive dashboards that are both functional and great looking allowing you and your end-users to explore your data in ways you have only imagined—allowing you to convert your data into actionable information.
Even if you are a Pivot Table novice, you can create functional and great looking dashboards. In this webinar, we will show you the basic steps for creating interactive dashboards in Excel 2010 and 2013. Taking a holistic SEM approach, we will examine several use-cases throughout the student lifecycle.
From setting up your data, to creating the dashboard and modifying it to your own school colors, we will cover the basics of setting up a simple, yet interactive and informative dashboards. Some basic knowledge of Pivot Tables is useful but not required.
Launch of the Week: Eastern Washington UniversityLaura Faccone
Each week, Schneider Associates analyzes the most significant brand, product, campaign or idea launch of the week. Learn more about launch at www.schneiderpr.com or email launch@schneiderpr.com.
The document provides information about graduate programs in engineering and computer science at the University of Southern California (USC) Viterbi School of Engineering. It discusses the school's master's and doctoral programs, admissions requirements, tuition costs, scholarships, internship and career opportunities, and contact information for the USC Viterbi India Office.
This document summarizes a presentation about priming the computer science pump by providing resources to establish robust CS programs. It discusses workforce trends showing a shortage of programmers and lack of CS exposure and enrollment. Charts show many open tech jobs in Austin being in CS and declining CS majors/AP enrollment. Barriers to CS education include lack of teachers and certifications. Recommendations include moving CS courses to CTE, expanding teacher training, offering engaging CS courses, and connecting schools to careers. The document promotes free online CS resources and curriculum from Code.org, UT Austin, TRC, and TEALS for teachers.
2019 01 16 data matters - v6 - Using data to support the student digital expe...jisc_digital_insights
Presentation to Data Matters conference on the 16th Jan 2019, entitled 'Using data to support the student digital experience'. Also included presentations by Marc Griffiths of LSBU and Marieke Guy at RAU
This document discusses top universities that offer MS degrees in Data Science. It begins by outlining typical MS in Data Science program curriculums, which generally integrate computer science, statistics, math, and programming. It then lists 14 top universities in the US that offer specialized MS in Data Science degrees, providing details on each program's curriculum, length, and location. The document emphasizes the importance of location for career opportunities and encourages readers to evaluate their profiles against these university programs.
LiFT Scholars Program: Supporting the Next Generation of Technology LeadersDr. Molly Morin
This presentation is from the 2018 NASPA Assessment and Persistence Annual Conference and provides an overview of the LiFT Scholars Program, a scholarship and support program for students who demonstrate financial need and are pursuing a career in the information technology (IT) fields from IUPUI and Ivy Tech Community College. This program now serves 140 students and not only helps build the students' professional networks in the IT field, but is also support the transfer pathways for Ivy Tech community college students, whom come from a variety of diverse backgrounds including: first-generation college students, military veterans, returning college students, refugee students, student parents, and more.
This presentation discusses the past, present, and future of educational technology in the Katy Independent School District. It reviews how technology has progressed over the last 20 years and outlines the current technology infrastructure and resources, including a $60 million annual budget. The presentation then explores emerging technologies that may be implemented in the next 1-5 years, such as cloud computing, mobile learning, learning analytics, open educational resources, 3D printing, and virtual laboratories. The goal is to continue enhancing technology to improve teaching, learning, communication, and access to educational opportunities.
The document summarizes information about the Project Lead The Way (PLTW) program, including its focus on engaging schools to implement PLTW programs, the regional directors overseeing implementation, and PLTW's recent expansion of its engagement network. It also discusses the need and demand for STEM education to keep the US competitive, outlines the PLTW curriculum programs, and summarizes benefits of the PLTW certification process for schools.
Overview of Effective Learning Analytics Using data and analytics to support ...Bart Rienties
Begona Nunez-Herran and Kevin Mayles (Data and Student Analytics), Rebecca Ward (Data Strategy and Governance)
-Move towards centralised LA data infrastructure
-Data governance and lessons learned
Prof Bart Rienties & PhD students (Institute of Educational Technology)
-What is the latest “blue sky” learning analytics research from the OU?
-Rogers Kalissa: Social Learning Analytics to support teaching (University of Oslo)
-Saman Rizvi: Cultural impact of MOOC learning (IET)
-Shi Min Chua: Why does no one reply to my posts (IET/WELS)
-Maina Korir: Ethics and LA (IET)
-Anna Gillespie: Predictive Learning Analytics and role of tutors (EdD)
Prof John Domingue (Knowledge Media Institute) & Dr Thea Herodotou (IET)
-What have we learned from 5 years of large scale implementation of OU Analyse?
-Where is LA/AI going?
John Watson of Evergreen Education Group shares his findings from the 2014 edition of Keeping Pace with K-12 Digital Learning: An Annual Review of Policy and Practice.
This deck covers 11 stories of success in addressing off-campus Education Broadband access by School Districts across the US. Districts covered are large-small, urban, suburban and rural, Title I, Title III, Pre-K programs through AP Learning in high school. Back of the deck has info about Kajeet used as a reference during Q&A.
The document discusses the role and potential of instructional technologies and ICT in education. It outlines several dilemmas and realities in effectively integrating ICT. The key potentials of ICT include expanding access to education, increasing efficiency, enhancing the quality of learning and teaching, facilitating skill formation, and improving planning and management. Realizing this potential requires addressing prerequisites like infrastructure, content, personnel training, and financial resources. The document concludes that ICT can make education more effective and responsive when properly integrated, though we must not lose sight of learning itself in marveling over the technologies.
Learning analytics and the learning and teaching journey | Prof Deborah West ...Blackboard APAC
Much work has been done across the sector in relation to learning analytics including the implementation of Analytics for Learn as well as Pyramid and SQL reporting. This work has provided us with data around learning and teaching interactions at various levels and in different contexts. From this data reports are generated that can be used in a variety of ways including to address issues of retention, assist with student success, support teaching practice and facilitate curriculum improvement . However, many academics are not quite sure of what is available, what it can be used for or the timing around usage. This can present a range of challenges including the under-utilisation of reports that are available, inappropriate use of reports or a sense that reports are not very useful. One way that we are tackling these challenges at Charles Darwin University it to conceptualise the reports within the framework of the learning and teaching journey. This includes a variety of perspectives from the student journey to the curriculum lifecycle. This also provides the opportunity to consider the relevance of reports to different learning and teaching contexts and approaches. This session will present our framework highlighting recommended time frames and applications for various reports as well as drawing attention to both the benefits and limitations of the approach.
The document introduces the University of Wisconsin-Madison's Engineering Professional Development program and its online Master's in Applied Computing and Engineering Data Analytics. It outlines EPD's expertise in online engineering education, the benefits of UW's approach including practical learning and interaction, and details of the Applied Computing master's program including its focus on data analytics skills, typical courses, and application process.
This survey was conducted with students from three universities to determine their knowledge of and interest in new technologies. The results showed that students' overall knowledge is low, with only 16% having learned about blockchain technology from their university. While interest was higher than knowledge, males had greater knowledge and interest than females. Master's students also had higher knowledge than Bachelor's students. Most students felt universities could better prepare them for the job market and apply new technologies more in their own teaching. They requested additional courses in emerging technologies to expand their learning. The survey concludes universities need improvements in how they impart knowledge and serve as models for applying new methods and technologies.
Aligning IT and University Strategy - Paul Curran - Jisc Digital Festival 2014Jisc
City University London has the ambition to be a leading global university and is investing heavily in academic staff, IT and its estate. This presentation will start with a discussion of some of the major sectoral trends in IT supply and demand with a focus on education.
The IT service at City in 2010/11 and today will be described, along with discussion of the journey and some of the challenges faced. Particular attention will be paid to a move from a devolved 'cottage industry' approach to a more centralised and commoditised but flexible approach to IT service; changing student expectations and aligning with the University’s Strategic Plan.
The presentation will conclude with some observations on this transition for both academic staff and IT professional staff.
The Long Road from Reactive to Proactive: Developing an Accessibility Strategy3Play Media
Implementing accessibility policies in higher education is no easy task. For many, it is easy to get caught in a cycle of reactive accommodation where larger accessibility policies are never implemented. So how do you transition from reactive policies to proactive policies?
Korey Singleton, the Assistive Technology Initiative Manager at George Mason University, will walk you through their two-year process of moving from reactive solutions to proactive accessibility policies. His own experience with how difficult it can be to shift campus climate and administrative support towards proactive accessibility is incredibly useful for other universities struggling with the same thing. His detailed presentation will provide insight into how George Mason has overcome these challenges and developed a proactive approach to accessibility.
This webinar will cover:
- Collaborative strategies for campus-wide IT accessibility
- Strategies for getting faculty to use and create accessible material
- George Mason's accessibility policies & recent updates
- Workflow, collaboration, and policy recommendations
- Resources for accessibility training and testing
- Analysis of completed accessible media requests by fiscal year
Pam Muth and Lisa Bolton: Optimising QILT to improve the student experienceStudiosity.com
The document discusses optimizing the Quality Indicators for Learning and Teaching (QILT) program to improve the student experience. QILT administers the Student Experience Survey (SES) and Graduate Outcomes Survey (GOS) to measure student engagement, teaching quality, and graduate employment outcomes. The SES collects data from current students on their educational experience. Results are available on the QILT website to allow students to compare institutions. Institutions receive detailed SES results and can integrate QILT data into strategic planning to monitor performance indicators over time. Customizing QILT surveys allows institutions to address questions not covered and better evaluate specific strategic initiatives.
Similar to Lessons from Data Science Program at Indiana University: Curriculum, Students and Online Delivery (20)
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
Most things are dominated by Artificial Intelligence (AI). Technology Companies like Amazon, Google, Facebook, and Microsoft are AI First organizations.
Engineering achievement today is highlighted by the AI buried in a vehicle or machine. Industry (Manufacturing) 4.0 focusses on the AI-Driven future of the Industrial Internet of Things.
Software is eating the world.
We can describe much computer systems work as designing, building and using the Global AI and Modelling supercomputer which itself is autonomously tuned by AI. We suggest that this is not just a bunch of buzzwords but has profound significance and examine consequences of this for education and research.
Naively high-performance computing should be relevant for the AI supercomputer but somehow the corporate juggernaut is not making so much use of it. We discuss how to change this.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
High Performance Computing and Big Data Geoffrey Fox
This document proposes a hybrid software stack that combines large-scale data systems from both research and commercial applications. It runs the commodity Apache Big Data Stack (ABDS) using enhancements from High Performance Computing (HPC) to improve performance. Examples are given from bioinformatics and financial informatics. Parallel and distributed runtimes like MPI, Storm, Heron, Spark and Flink are discussed, distinguishing between parallel (tightly-coupled) and distributed (loosely-coupled) systems. The document also discusses optimizing Java performance and differences between capacity and capability computing. Finally, it explains how this HPC-ABDS concept allows convergence of big data, big simulation, cloud and HPC systems.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
This memo describes experiences with online teaching in Spring Semester 2014. We discuss the technologies used and the approach to teaching/learning.
This work is based on Google Course Builder for a Big Data overview course
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Big Data and Clouds: Research and EducationGeoffrey Fox
Presentation September 9 2013 PPAM 2013 Warsaw
Economic Imperative: There are a lot of data and a lot of jobs
Computing Model: Industry adopted clouds which are attractive for data analytics. HPC also useful in some cases
Progress in scalable robust Algorithms: new data need different algorithms than before
Progress in Data Intensive Programming Models
Progress in Data Science Education: opportunities at universities
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
Describes FutureGrid and its role as a Computing Testbed as a Service. FutureGrid is user-customizable, accessed interactively and supports Grid, Cloud and HPC software with and without VM’s. Lessons learnt and example use cases are described
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Lessons from Data Science Program at Indiana University: Curriculum, Students and Online Delivery
1. Lessons from Data Science Program at
Indiana University: Curriculum, Students
and Online Delivery
NSF/TCPP Workshop on Parallel and Distributed Computing
Education
Edupar at IPDPS 2015
May 25, 2015
5/25/2015 1
Geoffrey Fox
gcf@indiana.edu http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
3. Background of the School
• The School of Informatics was established in 2000 as first of
its kind in the United States.
• Computer Science was established in 1971 and became part
of the school in 2005.
• Library and Information Science
was established in 1951 and
became part of the school
in 2013.
• Now named the School of
Informatics and Computing.
• Data Science added January 2014
• Engineering to be added Fall 2016
5/25/2015 3
4. What Is Our School About?
The broad range of computing and information technology:
science, a broad range of applications and human and
societal implications.
United by a focus on
information and technology,
our extensive programs
include:
• Computer Science
• Informatics
• Information Science
• Library Science
• Data Science (virtual - starting)
• Engineering (real - expected)
5/25/2015 4
5. Size of School
(2014-2015)
• Faculty ~100
• Students
Undergraduate 1,472
Master’s 649
Ph.D. 243
• Female Undergraduates 21%
(68% since 2007)
• Female Graduate Students 28%
(4% since 2007)
Undergraduates mainly Informatics (75%);
Graduates mainly Computer Science
5/25/2015 5
7. SOIC Data Science Program
• Cross Disciplinary Faculty – 31 in School of Informatics and Computing, a
few in statistics and expanding across campus
• Affordable online and traditional residential curricula or mix thereof
• Masters, Certificate, PhD Minor in place; Full PhD being studied
• http://www.soic.indiana.edu/graduate/degrees/data-science/index.html
• Note data science mentioned in faculty advertisements but unlike other
parts of School, there are no dedicated faculty
75/25/2015
It is around 7% of School
looking at fraction of
enrolled students summing
graduate and
undergraduate levels
8. IU Data Science Program and Degrees
• Program managed by cross disciplinary Faculty in Data Science.
Currently Statistics and Informatics and Computing School but
plans to expand scope to full campus
• A purely online 4-course Certificate in Data Science has been
running since January 2014
– Some switched to Online Masters
– Most students are professionals taking courses in “free time”
• A campus wide Ph.D. Minor in Data Science has been approved.
• Masters in Data Science (10-course) approved October 2014
• Exploring PhD in Data Science
• Courses labelled as “Decision-maker” and “Technical” paths
where McKinsey says an order of magnitude more (1.5 million by
2018) unmet job openings in Decision-maker track
85/25/2015
9. McKinsey Institute on Big Data Jobs
• There will be a shortage of talent necessary for organizations to take
advantage of big data. By 2018, the United States alone could face a
shortage of 140,000 to 190,000 people with deep analytical skills as
well as 1.5 million managers and analysts with the know-how to use
the analysis of big data to make effective decisions.
• IU Data Science Decision Maker Path aimed at 1.5 million jobs.
Technical Path covers the 140,000 to 190,000
http://www.mckinsey.com/mgi/publications/big_data/index.asp
95/25/2015
10. Job Trends
Big Data about an
order of magnitude
larger than data
science
19 May 2015 Jobs
3475 for “data science“
2277 for “data scientist“
19488 for “big data”
http://www.indeed.com/jobtrends?
q=%22Data+science%22%2C+%
22data+scientist%22%2C+%22bi
g+data%22%2C&l=
5/25/2015 10
11. What is Data Science?
• The next slide gives a definition arrived by a NIST study group fall 2013.
• The previous slide says there are several jobs but that’s not enough! Is this a
field – what is it and what is its core?
– The emergence of the 4th or data driven paradigm of science illustrates
significance - http://research.microsoft.com/en-
us/collaboration/fourthparadigm/
– Discovery is guided by data rather than by a model
– The End of (traditional) science http://www.wired.com/wired/issue/16-07
is famous here
• Another example is recommender systems in Netflix, e-commerce etc.
– Here data (user ratings of movies or products) allows an empirical
prediction of what users like
– Here we define points in spaces (of users or products), cluster them etc.
– all conclusions coming from data
5/25/2015 11
12. Data Science Definition from NIST Public Working Group
• Data Science is the extraction of actionable knowledge directly from data
through a process of discovery, hypothesis, and analytical hypothesis
analysis.
• A Data Scientist is a
practitioner who has sufficient
knowledge of the overlapping
regimes of expertise in
business needs, domain
knowledge, analytical skills
and programming expertise to
manage the end-to-end
scientific method process
through each stage in the big
data lifecycle.
See Big Data Definitions in
http://bigdatawg.nist.gov/V1_output_docs.php
5/25/2015 12
13. Some Existing Online Data Science
Activities
• Indiana University Masters is “blended”: online and/or
residential; other universities offer residential
• We discount online classes so that total cost of 10 courses is
~$11,500 (in state price)
30$35,490
5/25/2015 13
14. Computational Science
• Computational science has important similarities to data
science but with a simulation rather than data analysis flavor.
• Although a great deal of effort went into with meetings and
several academic curricula/programs, it didn’t take off
– In my experience not a lot of students were interested and
– The academic job opportunities were not great
• Data science has more jobs; maybe it will do better?
• Can we usefully link these concepts?
• PS both use parallel computing!
• In days gone by, I did research in particle physics
phenomenology which in retrospect was an early form of data
science using models extensively
5/25/2015 14
16. IU Data Science Program: Masters
• Masters Fully approved by University and State October 14 2014 and
started January 2015
• Blended online and residential (any combination)
– Online offered at in-state rates (~$1100 per course)
– Hybrid (online for a year and then residential) surprisingly not
popular
• Informatics, Computer Science, Information and Library Science in
School of Informatics and Computing and the Department of
Statistics, College of Arts and Science, IUB
• 30 credits (10 conventional courses)
• Basic (general) Masters degree plus tracks
– Currently only track is “Computational and Analytic Data Science”
– Other tracks expected such as Biomedical Data Science
165/25/2015
17. Data Science Enrollment
• Fall 2015, about 240 new applicants to program; cap enrollment
• Certificate in Data Science (started January 2014)
– Current 29
– 16 applicants (12 admits, 9 accepts)
• (reduced from last year as prefer Masters)
– Plus 40 students in special executive education certificate through Kelley
Business School – currently just one class
• Online Masters in Data Science (started January 2015)
– Current 54
– 42 applicants; just 2 “hybrid” (28 admits, 19 accepts)
– Transfers from certificate have head start
• Residential Masters in Data Science (started January 2015)
– Current 3
– 187 applicants (101 admits, 55 accepts)
• Expected Data Science total enrollment Fall 2015 150-170
5/25/2015 17
18. Advertising Campaign
• Comparison of “Adwords” results for Three Masters Programs
• Security Informatics
• Data Science
• Information and Library Science
• CPC Cost per Click
• CTR Click Through Rate
5/25/2015 18
Program Adwords
timeframe
Adwords
Cost
# clicks CTR CPC #
applications
Security 12/1-4/30 $13K 2,577 0.20% $5.10 26
Data
Science
10/31-3/30 $17K 38,544 1.28% $0.43 267
ILS 9/1-4/30 $18K 4,382 0.11% $4.08 199
20. 3 Types of Students
• Professionals wanting skills to improve job or
“required” by employee to keep up with technology
advances
• Traditional sources of IT Masters
• Students in non IT fields wanting to do “domain
specific data science”
5/25/2015 20
21. What do students want?
• Degree with some relevant curriculum
– Data Science and Computer Science distinct BUT
• Important goal often “Optional Practical Training” OPT
allowing graduated students visa to work for US
companies
– Must have spent at least a year in US in residential
program
• Residential CS Masters (at IU) 95% foreign students
• Online program students quite varied but mostly USA
professionals aiming to improve/switch job
215/25/2015
22. IU and Competition
• With Computer Science, Informatics, ILS, Statistics, IU has particularly broad
unrivalled technology base
– Other universities have more domain data science than IU
• Existing Masters in US in table. Many more certificates and related degrees
(such as business analytics)
School Program Campus Online Degree
Columbia University Data Science Yes No MS 30 cr
Illinois Institute of
Technology
Data Science Yes No MS 33 cr
New York University Data Science Yes No MS 36 cr
University of California
Berkeley School of
Information
Master of Information
and Data Science
Yes Yes M.I.D.S
University of Southern
California
Computer Science with
Data Science
Yes No MS 27 cr
5/25/2015 22
23. Data Science Curriculum
Faculty in Data Science is “virtual department”
4 course Certificate: purely online, started January 2014
10 course Masters: online/residential, started January 2015
5/25/2015 23
24. Basic Masters Course Requirements
• One course from two of three technology areas
– I. Data analysis and statistics
– II. Data lifecycle (includes “handling of research data”)
– III. Data management and infrastructure
• One course from (big data) application course cluster
• Other courses chosen from list maintained by Data Science Program
curriculum committee (or outside this with permission of advisor/ Curriculum
Committee)
• Capstone project optional
• All students assigned an advisor who approves course choice.
• Due to variation in preparation label courses
– Decision Maker
– Technical
• Corresponding to two categories in McKinsey report – note Decision Maker
had an order of magnitude more job openings expected
5/25/2015 24
25. Computational and Analytic Data Science track
• For this track, data science courses have been reorganized into categories reflecting
the topics important for students wanting to prepare for computational and analytic
data science careers for which a strong computer science background is necessary.
Consequently, students in this track must complete additional requirements,
• 1) A student has to take at least 3 courses (9 credits) from Category 1 Core
Courses. Among them, B503 Analysis of Algorithms is required and the student
should take at least 2 courses from the following 3:
– B561 Advanced Database Concepts,
– [STAT] S520 Introduction to Statistics OR (New Course) Probabilistic Reasoning
– B555 Machine Learning OR I590 Applied Machine Learning
• 2) A student must take at least 2 courses from Category 2 Data Systems, AND, at
least 2 courses from Category 3 Data Analysis. Courses taken in Category 1 can
be double counted if they are also listed in Category 2 or Category 3.
• 3) A student must take at least 3 courses from Category 2 Data Systems, OR, at
least 3 courses from Category 3 Data Analysis. Again, courses taken in Category 1
can be double counted if they are also listed in Category 2 or Category 3. One of
these courses must be an application domain course
5/25/2015 25
26. Admissions Criterion
• Decided by Data Science Program Curriculum
Committee
• Need some computer programming experience (either
through coursework or experience), and a
mathematical background and knowledge of statistics
will be useful
• Tracks can impose stronger requirements
• 3.0 Undergraduate GPA
• A 500 word personal statement
• GRE scores are required for all applicants.
• 3 letters of recommendation
5/25/2015 26
27. Geoffrey Fox’s
Online Data Science Classes I
Same class offered as
• MOOC
• Residential class
• Online class for credit
5/25/2015 27
28. Some Online Data Science Classes
• BDAA: Big Data Applications & Analytics
– Used to be called X-Informatics
– ~40 hours of video mainly discussing applications (The X in
X-Informatics or X-Analytics) in context of big data and
clouds https://bigdatacourse.appspot.com/course
• BDOSSP: Big Data Open Source Software and Projects
http://bigdataopensourceprojects.soic.indiana.edu/
– ~27 Hours of video discussing HPC-ABDS and use on
FutureSystems for Big Data software
• Both divided into sections (coherent topics), units (~lectures)
and lessons (5-20 minutes) in which student is meant to stay
awake
285/25/2015
29. • 1 Unit: Organizational Introduction
• 1 Unit: Motivation: Big Data and the Cloud; Centerpieces of the Future Economy
• 3 Units: Pedagogical Introduction: What is Big Data, Data Analytics and X-Informatics
• SideMOOC: Python for Big Data Applications and Analytics: NumPy, SciPy, MatPlotlib
• SideMOOC: Using FutureSystems for Java and Python
• 4 Units: X-Informatics with X= LHC Analysis and Discovery of Higgs particle
– Integrated Technology: Explore Events; histograms and models; basic statistics (Python and some in Java)
• 3 Units on a Big Data Use Cases Survey
• SideMOOC: Using Plotviz Software for Displaying Point Distributions in 3D
• 3 Units: X-Informatics with X= e-Commerce and Lifestyle
• Technology (Python or Java): Recommender Systems - K-Nearest Neighbors
• Technology: Clustering and heuristic methods
• 1 Unit: Parallel Computing Overview and familiar examples
• 4 Units: Cloud Computing Technology for Big Data Applications & Analytics
• 2 Units: X-Informatics with X = Web Search and Text Mining and their technologies
• Technology for Big Data Applications & Analytics : Kmeans (Python/Java)
• Technology for Big Data Applications & Analytics: MapReduce
• Technology for Big Data Applications & Analytics : Kmeans and MapReduce Parallelism (Python/Java)
• Technology for Big Data Applications & Analytics : PageRank (Python/Java)
• 3 Units: X-Informatics with X = Sports
• 1 Unit: X-Informatics with X = Health
• 1 Unit: X-Informatics with X = Internet of Things & Sensors
• 1 Unit: X-Informatics with X = Radar for Remote Sensing
Big Data Applications & Analytics Topics
Red = Software
295/25/2015
33. Course Home Page showing Syllabus
33
Note that we have a course – section – unit – lesson hierarchy (supported by Mooc
Builder) with abstracts available at each level of hierarchy. The home page has overview
information (shown earlier) plus a list of all sections and a syllabus shown above.
5/25/2015
34. A typical lesson (the first in unit 21) Note links to all
37 units across the top
345/25/2015
35. MOOC Version
• Offered at https://bigdatacourse.appspot.com/preview
• Open to everybody
• Uses no University resources
• Updated December 2014
• One of two SoIC MOOCs named one of “7 great MOOCs for techies”
by ComputerWorld http://www.computerworld.com/article/2849569/7-
great-moocs-for-techies-all-free-starting-soon.html November 2014
• May 14 2015 3562 enrolled – small by MOOC standards
• Students from 108 countries
– 1020 USA
– 916 India
– 180 Brazil
– ~130 France, Spain, UK
5/25/2015 35
Associate's degree 73
Bachelor's degree 1078
Doctorate 243
High School and equivalent 176
Master's degree 1257
Other 42
(blank) 693
Student Starting Level
37. Homeworks
• These are online within Google Course Builder for the MOOC
with peer assessment. In the 3 credit offerings, all graded
material (homework and projects) is conducted traditionally
through Indiana University Oncourse (superceded by Canvas).
• Oncourse was additionally used to assign which videos should
be watched each week and the discussion forum topics
described later (these were just “special homeworks in
Oncourse).
• In the non-residential data science certificate class, the
students were on a variable schedule (as typically working full
time and many distractions; one for example had faculty
position interviews) and considerable latitude was given for
video and homework completion dates.
375/25/2015
38. Discussion Forums
• Each offering had a separate set of electronic discussion forums which
were used for class announcements (replicating Oncourse) and for
assigned discussions.
• Following slide illustrates an assigned discussion on the implications of the
success of e-commerce for the future of “real malls”. The students were
given “participation credit” for posting here and these were very well
received.
• Later offerings made greater use of these forums. Based on student
feedback, we encouraged even greater participation through students both
posting and commenting.
• Note I personally do not like specialized (walled garden) forums and the
class forums were set up using standard Google Community Groups with a
familiar elegant interface. These community groups also link well to Google
Hangouts described later.
• As well as interesting topics, all class announcements were made in the
“Instructor” forum repeating information posted at Oncourse. Of course no
sensitive material such as returned homework was posted on Google site.
385/25/2015
39. The community group for one of classes and
one forum (“No more malls”)
5/25/2015 39
40. Hangouts and Adobe Connect
• For the purely online offering, we supplemented the asynchronous
material described above with real-time interactive Google Hangout
video sessions.
• Given varied time zones and weekday demands on students, these
were held at 1pm Eastern on Sundays.
• Google Hangouts are conveniently scheduled from community page
and offer interactive video and chat capabilities that were well
received. Other technologies such as Skype are also possible.
• Hangouts are restricted to 10-15 people which was sufficient for this
section but in general insufficient. Not all of 12 students attended a
given class.
• The Hangouts focused on general data science issues and the
mechanics of the class.
• Augment Hangout by non-video Adobe Connect session
405/25/2015
41. Figure 6: Community Events for Online
Data Science Certificate Course
415/25/2015
42. In class Sessions
• The residential sections had regular in class sessions; one 90
minute session per class each week. This was originally two
sessions but reduced to one partly because online videos turned
these into “flipped classes” with less need for in class time and
partly to accommodate more students (77 total graduate and
undergraduate) in two groups with separate classes.
• These classes were devoted to discussions of course material,
homework and largely the discussion forum topics. This part of
course was not greatly liked by the students – especially the
undergraduate section which voted in favor of a model with only
the online components (including the discussion forums which
they recommended expanding).
• In particular the 9.30am start time was viewed as too early and
intrinsically unattractive.
425/25/2015
44. 5/25/2015 44
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination:
Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, Azure Machine
Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM
Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana,
Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero,
OODT, Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream
Analytics, Floe
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Hama,
Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ,
NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub,
Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
May 15
2015
45. Big Data & Open Source Software Projects Overview
• This course studies DevOps and software used in many commercial
activities to study Big Data.
• The backdrop for course is the ~350 software subsystems HPC-ABDS
(High Performance Computing enhanced - Apache Big Data Stack)
illustrated at http://hpc-abds.org/kaleidoscope/
• The cloud computing architecture underlying ABDS and contrast of this with
HPC.
• The main activity of the course is building a significant project using multiple
HPC-ABDS subsystems combined with user code and data.
• Projects will be suggested or students can chose their own
• http://cloudmesh.github.io/introduction_to_cloud_computing/class/lesson/projects.html
• For more information,
see: http://bigdataopensourceprojects.soic.indiana.edu/ and
• http://cloudmesh.github.io/introduction_to_cloud_computing/class/bdossp_sp15/week_plan.html
• 25 Hours of Video
• Probably too much for semester class
5/25/2015 45
49. Unexpected Lessons
• We learnt some things from current offering of BDOSSP class
– 40 online students from around the world
• The hyperlinking of material caused students NOT to go through material
systematically
– Suggest go to structured hierarchy as in BDAA Course
– Followed from use of Canvas as mundane LMS plus multiple web
resources (Microsoft Office Mix and our computer support pages)
• Students did not use email and discussion groups in Canvas; we switched to
emails and list serves to the their main (not IU) email
• Very erratic progress due to different time zones and interruption of full time
job for each student
– Difficult to have communal “help” sessions and to give interactive support
at time student wanted
• OpenStack fragile!
5/25/2015 49
51. Background on MOOC’s
• MOOC’s are a “disruptive force” in the educational
environment
– Coursera, Udacity, Khan Academy and many others
• MOOC’s have courses and technologies
• Google Course Builder and OpenEdX are open source
MOOC technologies
• Blackboard and others are learning management systems
with (some) MOOC support
• Coursera Udacity etc. have internal proprietary MOOC
software
• This software is LMS++
• LMS= Learning Management system
515/25/2015
52. MOOC Style Implementations
• Courses from commercial sources, universities and
partnerships
• Courses with 100,000 students (free)
• Georgia Tech a leader in rigorous academic
curriculum – MOOC style Masters in Computer
Science (pay tuition, get regular GT degree)
• Interesting way to package tutorial material for
computers and software e.g.
– E.g. Course online programming laboratories supported by
MOOC modules on how to use system
525/25/2015
54. MOOCs in SC community
• Activities like CI-Tutor and HPC University are community
activities that have collected much re-usable education
material
• MOOC’s naturally support re-use at lesson or higher level
– e.g. include MPI on XSEDE MOOC as part of many parallel
programming classes
• Need to develop agreed ways to use backend servers (HPC or
Cloud) to support MOOC laboratories
– Students should be able to take MOOC classes from tablet or phone
• Parts of MOOC’s (Units or Sections) can be used as modules
to enhance classes in outreach activities
545/25/2015
57. Potpourri of Online Technologies
• Canvas (Indiana University Default): Best for interface with IU grading and
records
• Google Course Builder: Best for management and integration of
components
• Ad hoc web pages: alternative easy to build integration
• Microsoft Mix: Simplest faculty preparation interface
• Adobe Presenter/Camtasia: More powerful video preparation that support
subtitles but not clearly needed
• Google Community: Good social interaction support
• YouTube: Best user interface for videos (without Mix PowerPoint support)
• Hangout: Best for instructor-students online interactions (one instructor to 9
students with live feed). Hangout on air mixes live and streaming (30 second
delay from archived YouTube) and more participants
• OpenEdX at one time future of Google Course Builder and getting easier to
use but still significant effort
575/25/2015
58. Four Online Platforms I
5/25/2015 58
CourseBuilder OpenEdx IU Canvas OfficeMix Plugin for
Powerpoint
OpenSource Yes Yes No N/A
Microsoft Integration
(Office 365,
Onedrive, Azure
cloud)
No Yes. Predicted to be
included in the
upcoming release.
No N/A
Analytics Some analytics
included but not
comprehensive. Still
needs more
development.
No analytics included
but there is a version
0 alpha release
Analytics API
available for use.
External apps can be
developed
Very basic Very basic but more
useful than Canvas
Peer reviews Yes Yes Yes No
59. Four Online Platforms II
5/25/2015 59
CourseBuilder OpenEdx IU Canvas OfficeMix
Plugin for
Powerpoint
LTI Compliance
(Learning
Technologies
Integration)
Yes. CB as a
LTI provider or
consumer.
Yes. Yes.
Functionality
might be limited
by IU.
N/A
Ease of use
and
customization
scale
5/10 for students,
faculty, developers
7/10 – ease of use
by students, faculty
3/10 – customization
by developer
N/A PowerPoint
Slide labelled
Videos could
be an
advantage
Ease of
Deployment
10/10 1/10 N/A N/A
Cost Almost none; Can
rise with increase
usage of cloud
transactions but
usually a very low
cost operation
Very expensive
to deploy and
maintain the
servers; Need a
dedicated staff
for administering
servers;
IU provided N/A
1 – not easy
10 – very easy
60. Four Online Platforms III
5/25/2015 60
CourseBuilder OpenEdx IU Canvas OfficeMix
Plugin for
Powerpoint
Unique
Features and
Functionality
Skill maps
BigQuery for
Analytics
Good UI for
course
administration;
Integrated
forums,
grading, content
area, and much
more.
Export/Import
Grades
Enables faculty
to record their
own videos and
insert
interactive
content such as
quizzes,
programming
test-bed etc.
Common
features
Certificates
Generation
Supported;
Quizzes;
Assessments;
Peer Reviews;
Autograding
Certificate
Generation
Supported;
Quizzes;
Assessments;
Autograding;
Quizzes;
Assessments;
Peer Review
N/A
62. Lessons / Insights
• Data Science is a very healthy area
• At IU, I expect to grow in interest although set up as a
program has strange side effects
• Not clear if Online education is taking off but may be
distorted by US Company hiring practices
• I teach all my classes – residential or online -- with
online lectures
• All of this straightforward but hard work
• Current open source and proprietary MOOC software
not very satisfactory; “easy” to do better
• No reason to differentiate MOOC and general LMS
5/25/2015 62
63. Details of Masters Degree
Computational and Analytic Data
Science track
635/25/2015
64. Computational and Analytic Data Science track
• Category 1: Core Courses
• CSCI B503 Analysis of Algorithms
• CSCI B555 Machine Learning OR INFO I590 Applied Machine
Learning
• CSCI B561 Advanced Database Concepts
• STAT S520 Introduction to Statistics OR (New Course) Probabilistic
Reasoning
• Category 2: Data Systems
• CSCI B534 Distributed Systems CSCI B561 Advanced Database
Concepts, CSCI B662 Database Systems & Internal Design
• CSCI B649 Cloud Computing CSCI B649 Advanced Topics in
Privacy
• CSCI P538 Computer Networks
• INFO I533 Systems & Protocol Security & Information Assurance
• ILS Z534: Information Retrieval: Theory and Practice
645/25/2015
65. Computational and Analytic Data Science track
• Category 3: Data Analysis
• CSCI B565 Data Mining
• CSCI B555 Machine Learning
• INFO I590 Applied Machine Learning
• INFO I590 Complex Networks and Their Applications
• STAT S520 Introduction to Statistics
• (New Course) Probabilistic Reasoning
• (New Course CSCI) Algorithms for Big Data
• Category 4: Elective Courses
• CSCI B551 Elements of Artificial Intelligence
• CSCI B553 Probabilistic Approaches to Artificial Intelligence
• CSCI B659 Information Theory and Inference
• CSCI B661 Database Theory and Systems Design
• INFO I519 Introduction to Bioinformatics
• INFO I520 Security For Networked Systems
• INFO I529 Machine Learning in Bioinformatics
• INFO I590 Relational Probabilistic Models
• ILS Z637 - Information Visualization
• Every course in 500/600 SOIC related to data that is not in the list
• All courses from STAT that are 600 and above
655/25/2015
67. General Track: Areas I and II
• I. Data analysis and statistics: gives students skills to
develop and extend algorithms, statistical approaches, and
visualization techniques for their explorations of large scale
data. Topics include data mining, information retrieval,
statistics, machine learning, and data visualization and will be
examined from the perspective of “big data,” using examples
from the application focus areas described in Section IV.
• II. Data lifecycle: gives students an understanding of the data
lifecycle, from digital birth to long-term preservation. Topics
include data curation, data stewardship, issues related to
retention and reproducibility, the role of the library and data
archives in digital data preservation and scholarly
communication and publication, and the organizational, policy,
and social impacts of big data.
675/25/2015
68. General Track: Areas III and IV
• III. Data management and infrastructure: gives students skills to
manage and support big data projects. Data have to be described,
discovered, and actionable. In data science, issues of scale come to the
fore, raising challenges of storage and large-scale computation. Topics
in data management include semantics, metadata, cyberinfrastructure
and cloud computing, databases and document stores, and security and
privacy and are relevant to both data science and “big data” data
science.
• IV. Big data application domains: gives students experience with data
analysis and decision making and is designed to equip them with the
ability to derive insights from vast quantities and varieties of data. The
teaching of data science, particularly its analytic aspects, is most
effective when an application area is used as a focus of study. The
degree will allow students to specialize in one or more application areas
which include, but are not limited to Business analytics, Science
informatics, Web science, Social data informatics, Health and
Biomedical informatics.
685/25/2015
69. I. Data Analysis and Statistics
• CSCI B503 Analysis of Algorithms
• CSCI B553 Probabilistic Approaches to Artificial Intelligence
• CSCI B652: Computer Models of Symbolic Learning
• CSCI B659 Information Theory and Inference
• CSCI B551: Elements of Artificial Intelligence
• CSCI B555: Machine Learning
• CSCI B565: Data Mining
• INFO I573: Programming for Science Informatics
• INFO I590 Visual Analytics
• INFO I590 Relational Probabilistic Models
• INFO I590 Applied Machine Learning
• ILS Z534: Information Retrieval: Theory and Practice
• ILS Z604: Topics in Library and Information Science: Big Data Analysis for Web and Text
• ILS Z637: Information Visualization
• STAT S520 Intro to Statistics
• STAT S670: Exploratory Data Analysis
• STAT S675: Statistical Learning & High-Dimensional Data Analysis
• (New Course CSCI) Algorithms for Big Data
• (New Course CSCI) Probabilistic Reasoning
• All courses from STAT that are 600 and above
695/25/2015
70. II. Data Lifecycle
• INFO I590: Data Provenance
• INFO I590 Complex Systems
• ILS Z604 Scholarly Communication
• ILS Z636: Semantic Web
• ILS Z652: Digital Libraries
• ILS Z604: Data Curation
• (New Course INFO): Social and Organizational Informatics of
Big Data
• (New Course ILS: Project Management for Data Science
• (New Course ILS): Big Data Policy
705/25/2015
71. III. Data Management and Infrastructure
• CSCI B534: Distributed Systems
• CSCI B552: Knowledge-Based Artificial Intelligence
• CSCI B561: Advanced Database Concepts
• CSCI B649: Cloud Computing (offered online)
• CSCI B649 Advanced Topics in Privacy
• CSCI B649: Topics in Systems: Cloud Computing for Data Intensive Sciences
• CSCI B661: Database Theory and System Design
• CSCI B662 Database Systems & Internal Design
• CSCI B669: Scientific Data Management and Preservation
• CSCI P536: Operating Systems
• CSCI P538 Computer Networks
• INFO I520 Security For Networked Systems
• INFO I525: Organizational Informatics and Economics of Security
• INFO I590 Complex Networks and their Applications
• INFO I590: Topics in Informatics: Data Management for Big Data
• INFO I590: Topics in Informatics: Big Data Open Source Software and Projects
• ILS S511: Database
• Every course in 500/600 SOIC related to data that is not in the list
715/25/2015
72. IV. Application areas
• CSCI B656: Web mining
• CSCI B679: Topics in Scientific Computing: High Performance Computing
• INFO I519 Introduction to Bioinformatics
• INFO I529 Machine Learning in Bioinformatics
• INFO I533 Systems & Protocol Security & Information Assurance
• INFO I590: Topics in Informatics: Big Data Applications and Analytics
• INFO I590: Topics in Informatics: Big Data in Drug Discovery, Health and
Translational Medicine
• ILS Z605: Internship in Data Science
• Kelley School of Business: business analytics course(s)
• Other courses from Indiana University e.g. Physics Data Analysis
725/25/2015
74. Technical Track of General DS Masters
• Year 1 Semester 1:
– INFO 590: Topics in Informatics: Big Data Applications and Analytics
– ILS Z604: Big Data Analytics for Web and Text
– STAT S520: Intro to Statistics
• Year 1: Semester 2:
– CSCI B661: Database Theory and System Design
– ILS Z637: Information Visualization
– STAT S670: Exploratory Data Analysis
• Year 1: Summer:
– CSCI B679: Topics in Scientific Computing: High Performance
Computing
• Year 2: Semester 3:
– CSCI B555: Machine Learning
– STAT S670: Exploratory Data Analysis
– CSCI B649: Cloud Computing
745/25/2015
75. Computational and Analytic Data Science track
• Year 1 Semester 1:
– B503 Analysis of Algorithms
– B561 Advanced Database Concepts
– S520 Introduction to Statistics
• Year 1: Semester 2:
– B649 Cloud Computing
– Z534: Information Retrieval: Theory and Practice
– B555 Machine Learning
• Year 1: Summer:
– ILS 605: Internship in Data Science
• Year 2: Semester 3:
– B565 Data Mining
– I520 Security For Networked Systems
– Z637 - Information Visualization
755/25/2015
76. An Information-oriented Track
• Year 1 Semester 1:
– INFO 590: Topics in Informatics: Big Data Applications and Analytics
– ILS Z604 Big Data Analytics for Web and Text.
– STAT S520 Intro to Statistics
• Year 1: Semester 2:
– CSCI B661 Database Theory and System Design
– ILS Z637: Information Visualization
– ILS Z653: Semantic Web
• Year 1: Summer:
– ILS 605: Internship in Data Science
• Year 2: Semester 3:
– ILS Z604 Data Curation
– ILS Z604 Scholarly Communication
– INFO I590: Data Provenance
765/25/2015