Most things are dominated by Artificial Intelligence (AI). Technology Companies like Amazon, Google, Facebook, and Microsoft are AI First organizations.
Engineering achievement today is highlighted by the AI buried in a vehicle or machine. Industry (Manufacturing) 4.0 focusses on the AI-Driven future of the Industrial Internet of Things.
Software is eating the world.
We can describe much computer systems work as designing, building and using the Global AI and Modelling supercomputer which itself is autonomously tuned by AI. We suggest that this is not just a bunch of buzzwords but has profound significance and examine consequences of this for education and research.
Naively high-performance computing should be relevant for the AI supercomputer but somehow the corporate juggernaut is not making so much use of it. We discuss how to change this.
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
“Big Data for IoT: Analytics from Descriptive to Predictive to Prescriptive” was presented to the Phoenix Data Conference on 11/4/17 at Grand Canyon University.
As the Internet of Things (IoT) floods data lakes and fills data oceans with sensor and real-world data, analytic tools and real-time responsiveness will require improved platforms and applications to deal with the data flow and move from descriptive to predictive to prescriptive analysis and outcomes.
Short introduction to Big Data Analytics, the Internet of Things, and their s...Andrei Khurshudov
Invited talk at the 26th ASME annual conference on information and storage and processing systems (ISPS 2017) held at Hilton San Francisco District, San Francisco, California, USA from August 29–30, 2017.
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
“Big Data for IoT: Analytics from Descriptive to Predictive to Prescriptive” was presented to the Phoenix Data Conference on 11/4/17 at Grand Canyon University.
As the Internet of Things (IoT) floods data lakes and fills data oceans with sensor and real-world data, analytic tools and real-time responsiveness will require improved platforms and applications to deal with the data flow and move from descriptive to predictive to prescriptive analysis and outcomes.
Short introduction to Big Data Analytics, the Internet of Things, and their s...Andrei Khurshudov
Invited talk at the 26th ASME annual conference on information and storage and processing systems (ISPS 2017) held at Hilton San Francisco District, San Francisco, California, USA from August 29–30, 2017.
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17Mark Goldstein
Mark Goldstein, International Research Center presented a big overview of Emerging Information Technology Innovations & Trends to the Society for Information Management Arizona Chapter (SIM AZ) on 11/15/17 showcasing the latest and greatest emerging technologies and novel tech innovations, highlighting the market and societal transformations underway or anticipated. It covered Advances in Computer Power and Pervasiveness; Internet of Things (IoT) Overview and Ecosystem; Mobility, Augmented Reality and Virtual Reality (AR/VR); Medical Advances Through Informatics; Artificial Intelligence (AI) and Robotics; Big Data, Its Applications and Implications; and Onward into the Future…
Green Compute and Storage - Why does it Matter and What is in ScopeNarayanan Subramaniam
Presentation made for BITS students under the auspices of IEEE Goa on the account of Lumini '21 - BITS Goa's annual technical symposium. Topic provides an overview as to why green compute/storage is important as the Internet explodes with voice, video and other content consuming 8% (3 TWh) of total global electricity production rising exponentially to 21% (9 TWh) by 2030. This is likely to be accelerated with the advent of 5G and IoT everywhere. I explore 3 key pillars of computing with respect to "green" and the consequences that need to be mitigated in short order.
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
Cyber-Physical Energy Systems (CPES) exploit the potential of information technology to boost energy efficiency while minimising environmental impacts. CPES can help manage energy more efficiently by providing a functional view of the entire energy system so that energy activities can be understood, changed, and reinvented to better support sustainable practices. CPES can be applied at different scales from Smart Grids and Smart Cities to Smart Enterprises and Smart Buildings. Significant technical challenges exist in terms of information management, leveraging real-time sensor data, coordination of the various stakeholders to optimize energy usage.
In this talk I describe an approach to overcome these challenges by re-using the Web standards to quickly connect the required systems within a CPES. The resulting lightweight architecture leverages Web technologies including Linked Data, the Web of Things, and Social Media. The paper describes the fundamentals of the approach and demonstrates it within an Enterprise Energy Management scenario smart building.
Linked Water Data For Water Information ManagementEdward Curry
The management of water consumption is hindered by low general awareness and absence of precise historical and contextual information. Effective and efficiency management of water resources requires a holistic approach considering all the stages of water usage. A decision support tool for water management services requires access to a number of different data domains and different data providers. The design of next-generation water information management systems poses significant technical challenges in terms of information management, integration of heterogeneous data, and real-time processing of dynamic data. Linked Data is a set of web technologies that enables integration of different data sources. This work investigates the usage of Linked Data technologies in the Water Management domain, describes the fundamental concepts of the approach, details an architecture, and discusses possible water management applications.
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
The emerging phenomenon called ―Big Data‖ is pushing numerous changes in businesses and several other organizations, Domains, Fields, areas etc. Many of them are struggling just to manage the massive data sets. Big data management is about two things - ―Big data‖ and ―Data Management‖ and these terms work together to achieve business and technology goals as well. In previous few years data generation have tremendously enhanced due to digitization of data. Day by day new computer tools and technologies for transmission of data among several computers through Internet is been increasing. It‗s relevance and importance in the context of applicability, usefulness for decision making, performance improvement etc in all areas have emerged very fast to be relevant in today‗s era. Big data management also has numerous challenges and common complexities include low organizational maturity relative to big data, weak business support, and the need to learn new technology approaches. This paper will discuss the impacts of Big Data and issues related to data management using current technologies
Computer Science is an ever-changing field with new inventions each day. Here are the latest trends in the field of computer science which are making their mark in this era of digitization.
Source: http://www.techsparks.co.in
An overview session on Grid Computing conducted in a AICTE approved STTP Virtualization, Cloud Computing and Big Data at Vidyalankar Institute of Technology, Mumbai between December 9 and 20, 2013. About 53 participants from various colleges across the state attended it. Courtesy: Consolidation from Internet.
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
Learn about the industry's new open metadata standard Egeria, introduced in September by ODPi, The Linux Foundation’s Open Data Platform initiative. Egeria supports the free flow of standardized metadata between different technologies and vendor platforms, enabling organizations to locate, manage and use their data resources more effectively. Explore how Egeria's set of open APIs, types and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery and access frameworks for automating the collection, management and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed and used in order to deliver maximum value to the enterprise.
This presentation by ODPi Director John Mertic provides an introduction to Egeria, and explores how the standard provides a vendor-neutral approach to data governance. Learn how a group of companies led by ING, IBM and Hortonworks came together through the open source community to re-imagining data governance and delivered Egeria -- to automate the collection, management and use of metadata across organizations of any size and complexity. Learn how Egeria was built on open standards and delivered via Apache 2.0 open source license.
General introduction to Big Data terms and technologies: Velocity, Volume, Variety (3V) and Veracity (4V), NoSQL, Data Science, main data stores (key-value, column, document, graph), Elasticsearch, ...
Presentation of data.be products leveraging Big Data & Elasticsearch
cloud computing - concepts and technologies and mechanisms of tackling problems in cloud
you plz ignore who created it , plz focus on problem oriented points
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17Mark Goldstein
Mark Goldstein, International Research Center presented a big overview of Emerging Information Technology Innovations & Trends to the Society for Information Management Arizona Chapter (SIM AZ) on 11/15/17 showcasing the latest and greatest emerging technologies and novel tech innovations, highlighting the market and societal transformations underway or anticipated. It covered Advances in Computer Power and Pervasiveness; Internet of Things (IoT) Overview and Ecosystem; Mobility, Augmented Reality and Virtual Reality (AR/VR); Medical Advances Through Informatics; Artificial Intelligence (AI) and Robotics; Big Data, Its Applications and Implications; and Onward into the Future…
Green Compute and Storage - Why does it Matter and What is in ScopeNarayanan Subramaniam
Presentation made for BITS students under the auspices of IEEE Goa on the account of Lumini '21 - BITS Goa's annual technical symposium. Topic provides an overview as to why green compute/storage is important as the Internet explodes with voice, video and other content consuming 8% (3 TWh) of total global electricity production rising exponentially to 21% (9 TWh) by 2030. This is likely to be accelerated with the advent of 5G and IoT everywhere. I explore 3 key pillars of computing with respect to "green" and the consequences that need to be mitigated in short order.
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
Cyber-Physical Energy Systems (CPES) exploit the potential of information technology to boost energy efficiency while minimising environmental impacts. CPES can help manage energy more efficiently by providing a functional view of the entire energy system so that energy activities can be understood, changed, and reinvented to better support sustainable practices. CPES can be applied at different scales from Smart Grids and Smart Cities to Smart Enterprises and Smart Buildings. Significant technical challenges exist in terms of information management, leveraging real-time sensor data, coordination of the various stakeholders to optimize energy usage.
In this talk I describe an approach to overcome these challenges by re-using the Web standards to quickly connect the required systems within a CPES. The resulting lightweight architecture leverages Web technologies including Linked Data, the Web of Things, and Social Media. The paper describes the fundamentals of the approach and demonstrates it within an Enterprise Energy Management scenario smart building.
Linked Water Data For Water Information ManagementEdward Curry
The management of water consumption is hindered by low general awareness and absence of precise historical and contextual information. Effective and efficiency management of water resources requires a holistic approach considering all the stages of water usage. A decision support tool for water management services requires access to a number of different data domains and different data providers. The design of next-generation water information management systems poses significant technical challenges in terms of information management, integration of heterogeneous data, and real-time processing of dynamic data. Linked Data is a set of web technologies that enables integration of different data sources. This work investigates the usage of Linked Data technologies in the Water Management domain, describes the fundamental concepts of the approach, details an architecture, and discusses possible water management applications.
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
The emerging phenomenon called ―Big Data‖ is pushing numerous changes in businesses and several other organizations, Domains, Fields, areas etc. Many of them are struggling just to manage the massive data sets. Big data management is about two things - ―Big data‖ and ―Data Management‖ and these terms work together to achieve business and technology goals as well. In previous few years data generation have tremendously enhanced due to digitization of data. Day by day new computer tools and technologies for transmission of data among several computers through Internet is been increasing. It‗s relevance and importance in the context of applicability, usefulness for decision making, performance improvement etc in all areas have emerged very fast to be relevant in today‗s era. Big data management also has numerous challenges and common complexities include low organizational maturity relative to big data, weak business support, and the need to learn new technology approaches. This paper will discuss the impacts of Big Data and issues related to data management using current technologies
Computer Science is an ever-changing field with new inventions each day. Here are the latest trends in the field of computer science which are making their mark in this era of digitization.
Source: http://www.techsparks.co.in
An overview session on Grid Computing conducted in a AICTE approved STTP Virtualization, Cloud Computing and Big Data at Vidyalankar Institute of Technology, Mumbai between December 9 and 20, 2013. About 53 participants from various colleges across the state attended it. Courtesy: Consolidation from Internet.
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
Learn about the industry's new open metadata standard Egeria, introduced in September by ODPi, The Linux Foundation’s Open Data Platform initiative. Egeria supports the free flow of standardized metadata between different technologies and vendor platforms, enabling organizations to locate, manage and use their data resources more effectively. Explore how Egeria's set of open APIs, types and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery and access frameworks for automating the collection, management and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed and used in order to deliver maximum value to the enterprise.
This presentation by ODPi Director John Mertic provides an introduction to Egeria, and explores how the standard provides a vendor-neutral approach to data governance. Learn how a group of companies led by ING, IBM and Hortonworks came together through the open source community to re-imagining data governance and delivered Egeria -- to automate the collection, management and use of metadata across organizations of any size and complexity. Learn how Egeria was built on open standards and delivered via Apache 2.0 open source license.
General introduction to Big Data terms and technologies: Velocity, Volume, Variety (3V) and Veracity (4V), NoSQL, Data Science, main data stores (key-value, column, document, graph), Elasticsearch, ...
Presentation of data.be products leveraging Big Data & Elasticsearch
cloud computing - concepts and technologies and mechanisms of tackling problems in cloud
you plz ignore who created it , plz focus on problem oriented points
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
Cloud Computing Evolution
Why Cloud Computing needed?
Cloud Computing Models
Cloud Solutions
Cloud Jobs opportunities
Criteria for Big Data
Big Data challenges
Technologies to process Big Data- Hadoop
Hadoop History and Architecture
Hadoop Eco-System
Hadoop Real-time Use cases
Hadoop Job opportunities
Hadoop and SAP HANA integration
Summary
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Internet of Things Presentation to Los Angeles CTO ForumFred Thiel
What are the impacts to our systems and businesses when billions of devices start sharing data? This presentation covers some important statistics about the implications of the coming IoT wave and how it will disrupt those who are not prepared.
Content1. Introduction2. What is Big Data3. Characte.docxdickonsondorris
Content
1. Introduction
2. What is Big Data
3. Characteristic of Big Data
4. Storing,selecting and processing of Big Data
5. Why Big Data
6. How it is Different
7. Big Data sources
8. Tools used in Big Data
9. Application of Big Data
10. Risks of Big Data
11. Benefits of Big Data
12. How Big Data Impact on IT
13. Future of Big Data
Introduction
• Big Data may well be the Next Big Thing in the IT
world.
• Big data burst upon the scene in the first decade of the
21st century.
• The first organizations to embrace it were online and
startup firms. Firms like Google, eBay, LinkedIn, and
Facebook were built around big data from the
beginning.
• Like many new information technologies, big data can
bring about dramatic cost reductions, substantial
improvements in the time required to perform a
computing task, or new product and service offerings.
• ‘Big Data’ is similar to ‘small data’, but bigger in
size
• but having data bigger it requires different
approaches:
– Techniques, tools and architecture
• an aim to solve new problems or old problems in a
better way
• Big Data generates value from the storage and
processing of very large quantities of digital
information that cannot be analyzed with
traditional computing techniques.
What is BIG DATA?
What is BIG DATA
• Walmart handles more than 1 million customer
transactions every hour.
• Facebook handles 40 billion photos from its user base.
• Decoding the human genome originally took 10years to
process; now it can be achieved in one week.
Three Characteristics of Big Data V3s
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types
1st Character of Big Data
Volume
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
2nd Character of Big Data
Velocity
• Clickstreams and ad impressions capture user behavior at
millions of events per second
• high-frequency stock trading algorithms reflect market
changes within microseconds
• machine to machine processes exchange data between
billions of devices
• infrastructure and sensors generate massive log data in real-
time
• on-line gaming systems support millions of concurrent
users, each producing multiple inputs per second.
3rd Character of Big Data
Variety
• Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files and
social media.
• Traditional database systems were designed to
address smaller volumes of structured data, fewer
updates or a predictable, consistent data stru.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
High Performance Computing and Big Data Geoffrey Fox
We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics.
We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems.
We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.
We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
DTW: 2015 Data Teaching Workshop – 2nd IEEE STC CC and RDA Workshop on Curricula and Teaching Methods in Cloud Computing, Big Data, and Data Science
as part of CloudCom 2015 (http://2015.cloudcom.org/), Vancouver, Nov 30-Dec 3, 2015.
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics; The other is BDOSSP: Big Data Open Source Software and Projects. Links are
http://openedx.scholargrid.org/ BDAA Fall 2015
http://datascience.scholargrid.org/ BDOSSP Spring 2016
http://bigdataopensourceprojects.soic.indiana.edu/ Spring 2015
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
Invited talk at NSF/TCPP Workshop on Parallel and Distributed Computing Education Edupar at IPDPS 2015 May 25, 2015 5/25/2015 Hyderabad
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics https://bigdatacourse.appspot.com/course. The other is BDOSSP: Big Data Open Source Software and Projects http://bigdataopensourceprojects.soic.indiana.edu/
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
This memo describes experiences with online teaching in Spring Semester 2014. We discuss the technologies used and the approach to teaching/learning.
This work is based on Google Course Builder for a Big Data overview course
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Big Data and Clouds: Research and EducationGeoffrey Fox
Presentation September 9 2013 PPAM 2013 Warsaw
Economic Imperative: There are a lot of data and a lot of jobs
Computing Model: Industry adopted clouds which are attractive for data analytics. HPC also useful in some cases
Progress in scalable robust Algorithms: new data need different algorithms than before
Progress in Data Intensive Programming Models
Progress in Data Science Education: opportunities at universities
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
AI-Driven Science and Engineering with the Global AI and Modeling Supercomputer GAIMSC
1. AI-Driven Science and Engineering with
the Global AI and Modeling
Supercomputer GAIMSC
Workshop on Clusters, Clouds, and Data for Scientific Computing CCDSC 2018
Châteauform’, La Maison des Contes, 427 Chemin de Chanzé, (near Lyon) France
September 4-7 2018
http://www.netlib.org/utk/people/JackDongarra/CCDSC-2018/
Geoffrey Fox, September 5, 2018
Digital Science Center
Department of Intelligent Systems Engineering
Indiana University Bloomington
gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/
1
2. Let’s learn from Microsoft Research what
they think are hot areas
• Industry’s role in research much larger today than 20-40 years ago
• Microsoft Research has about 1000 researchers and has 800 interns per year
• One of the largest computer science research organizations (INRIA larger)
• They just held a faculty summit August 2018 largely focused on systems for AI
• https://www.microsoft.com/en-us/research/event/faculty-summit-2018/
• With an interesting overview at the end positioning their work as building
designing and using the "Global AI Supercomputer" concept linking the
Intelligent Cloud to the Intelligent Edge
https://www.youtube.com/watch?v=jsv7EWhCqIQ&feature=youtu.be
2
5. Adding Modeling?
• Microsoft is very optimistic and excited
• I added “Modeling” to get the
Global AI and Modeling Supercomputer GAIMSC
• Modeling was meant to
• Include classic simulation oriented supercomputers
• Even just in Big Data, one needs to build a model for the machine learning
to use
• Note many talks discussed autotuning i.e. using GAIMSC to optimize GAIMSC
5
6. Possible Slogans for Research in
Global AI and Modeling Supercomputer arena
• AI First Science and Engineering
• Global AI and Modeling Supercomputer (Grid)
• Linking Intelligent Cloud to Intelligent Edge
• Linking Digital Twins to Deep Learning
• High-Performance Big-Data Computing
• Big Data and Extreme-scale Computing (BDEC)
• Common Digital Continuum Platform for Big Data and Extreme Scale
Computing (BDEC2)
• Using High Performance Computing ideas/technologies to give higher
functionality and performance “cloud” and “edge” systems
• Software 2.0 – replace Python by Training Data
• Industry 4.0 – Software defined machines or Industrial Internet of Things
6
7. Big Data and Extreme-scale Computing
http://www.exascale.org/bdec/
• BDEC Pathways to Convergence Report
• New series BDEC2 “Common Digital Continuum Platform for Big Data and
Extreme Scale Computing” with first meeting November 28-30, 2018
Bloomington Indiana USA.
• First day is evening reception with meeting focus “Defining application requirements for a
Common Digital Continuum Platform for Big Data and Extreme Scale Computing”
• Next meetings: February 19-21 Kobe, Japan (National infrastructure visions)
followed by two in Europe, one in USA and one in China.
http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/w
hitepapers/bdec2017pathways.pdf
7
8. AI First Research for AI-Driven Science and Engineering
• Artificial Intelligence is a dominant disruptive technology affecting all our
activities including business, education, research, and society.
• Further, several companies have proposed AI first strategies.
• They used to be mobile first
• The AI disruption is typically associated with big data coming from edge,
repositories or sophisticated scientific instruments such as telescopes, light
sources and gene sequencers.
• AI First requires mammoth computing resources such as clouds,
supercomputers, hyperscale systems and their distributed integration.
• AI First strategy is using the Global AI and Modeling Supercomputer GAIMSC
• Indiana University is examining a Masters in AI-Driven Engineering
8
9. AI First Publicity: 2017 Headlines
• The Race For AI: Google, Twitter, Intel, Apple In A Rush To Grab Artificial Intelligence
Startups
• Google, Facebook, And Microsoft Are Remaking Themselves Around AI
• Google: The Full Stack AI Company
• Bezos Says Artificial Intelligence to Fuel Amazon's Success
• Microsoft CEO says artificial intelligence is the 'ultimate breakthrough'
• Tesla’s New AI Guru Could Help Its Cars Teach Themselves
• Netflix Is Using AI to Conquer the World... and Bandwidth Issues
• How Google Is Remaking Itself As A “Machine Learning First” Company
• If You Love Machine Learning, You Should Check Out General Electric
10. Requirements for Global AI (and Modeling) Supercomputer
• Application Requirements: The structure of application clearly impacts needed
hardware and software
• Pleasingly parallel
• Workflow
• Global Machine Learning
• Data model: SQL, NoSQL; File Systems, Object store; Lustre, HDFS
• Distributed data from distributed sensors and instruments (Internet of Things)
requires Edge computing model
• Device – Fog – Cloud model and streaming data software and algorithms
• Hardware: node (accelerators such as GPU or KNL for deep learning) and multi-
node architecture configured as AI First HPC Cloud;
• Disks speed and location
• Software requirements: Programming model for GAIMSC
• Analytics
• Data management
• Streaming or Repository access or both
10
11. Distinctive Features of Applications
• Ratio of data to model sizes: vertical axis on next slide
• Importance of Synchronization – ratio of inter-node communication
to node computing: horizontal axis on next slide
• Sparsity of Data or Model; impacts value of GPU’s or vector
computing
• Irregularity of Data or Model
• Geographic distribution of Data as in edge computing; use of
streaming (dynamic data) versus batch paradigms
• Dynamic model structure as in some iterative algorithms
11
12. Big Data and Simulation Difficulty in Parallelism
Size of Synchronization constraints
Pleasingly Parallel
Often independent events
MapReduce as in
scalable databases
Structured Adaptive Sparse
Loosely Coupled
Largest scale
simulations
Current major Big
Data category
Commodity Clouds
HPC Clouds: Accelerators
High Performance Interconnect
Exascale Supercomputers
Global Machine
Learning
e.g. parallel
clustering
Deep Learning
HPC Clouds/Supercomputers
Memory access also critical
Unstructured Adaptive Sparse
Graph Analytics e.g.
subgraph mining
LDA
Linear Algebra at core
(often not sparse)
Size of
Disk I/O
Tightly Coupled
Parameter sweep
simulations
Just two problem characteristics
There is also data/compute distribution seen in grid/edge computing
12
13. Remembering Grid Computing: IoT and Distributed Center I
• Hyperscale data centers will grow from 338 in number at the end of 2016 to 628 by 2021.
They will represent 53 percent of all installed data center servers by 2021.
• They form a distributed Compute (on data) grid with some 50 million servers
• 94 percent of workloads and compute instances will be processed by cloud data centers by
2021-- only six percent will be processed by traditional data centers.
• Analysis from CISCO https://www.cisco.com/c/en/us/solutions/collateral/service-
provider/global-cloud-index-gci/white-paper-c11-738085.html
13
Number
of
instances
per server
Number
of Cloud
Data
Centers
Number of
Public or
Private
Cloud Data
Center
Instances
14. Remembering Grid Computing: IoT and Distributed Center II
• By 2021, Cisco expects IoT connections to reach 13.7 billion, up from 5.8
billion in 2016, according to its Global Cloud Index.
• Globally, the data stored in data centers will nearly quintuple by 2021 to
reach 1.3ZB by 2021, up 4.6-fold (a CAGR of 36 percent) from 286 exabytes
(EB) in 2016.
• Big data will reach 403 EB by 2021, up almost eight-fold from 25EB in 2016.
Big data will represent 30 percent of data stored in data centers by 2021,
up from 18 percent in 2016.
• The amount of data stored on devices will be 4.5-times higher than data
stored in data centers, at 5.9ZB by 2021.
• Driven largely by IoT, the total amount of data created (and not necessarily
stored) by any device will reach 847ZB per year by 2021, up from 218ZB per
year in 2016.
• The Intelligent Edge or IoT is a distributed Data Grid
14
16. Overall Global AI and Modeling Supercomputer
GAIMSC Architecture
• There is only a cloud at the logical center but it’s physically distributed and
owned by a few major players
• There is a very distributed set of devices surrounded by local Fog computing;
this forms the logically and physically distributed edge
• The edge is structured and largely data
• These are two differences from the Grid of the past
• e.g. self driving car will have its own fog and will not share fog with truck that it is about
to collide with
• The cloud and edge will both be very heterogeneous with varying accelerators,
memory size and disk structure.
16
17. Its is challenging these days to compete with Berkeley, Stanford, Google, Facebook, Amazon,
Microsoft, IBM.
Zaharia from Stanford (earlier Berkeley and Spark)
17
18. Collaborating on the
Global AI and Modeling Supercomputer GAIMSC
• Microsoft says:
• We can only “play together” and link functionalities from Google, Amazon, Facebook,
Microsoft, Academia if we have open API’s and open code to customize
• We must collaborate
• Open source Apache software
• Academia needs to use and define their own Apache projects
• We want to use AI supercomputer for AI-Driven science studying the early universe and
the Higgs boson and not just producing annoying advertisements (goal of most elite CS
researchers)
18
19. 19
ML Code
NIPS 2015 http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Should we train data scientists or data engineers?
Gartner says that 3 times as many jobs for data engineers as data scientists.
There is more
to life (jobs)
than Machine
Learning!
20. Gartner on Data Engineering
• Gartner says that job numbers in data science teams are
• 10% - Data Scientists
• 20% - Citizen Data Scientists ("decision makers")
• 30% - Data Engineers
• 20% - Business experts
• 15% - Software engineers
• 5% - Quant geeks
• ~0% - Unicorns
(very few exist!)
20
21. Ways of adding High Performance to
Global AI (and Modeling) Supercomputer
• Fix performance issues in Spark, Heron, Hadoop, Flink etc.
• Messy as some features of these big data systems intrinsically slow in some (not all)
cases
• All these systems are “monolithic” and difficult to deal with individual components
• Execute HPBDC from classic big data system with custom communication
environment – approach of Harp for the relatively simple Hadoop
environment
• Provide a native Mesos/Yarn/Kubernetes/HDFS high performance execution
environment with all capabilities of Spark, Hadoop and Heron – goal of
Twister2
• Execute with MPI in classic (Slurm, Lustre) HPC environment
• Add modules to existing frameworks like Scikit-Learn or Tensorflow either as
new capability or as a higher performance version of existing module.
21
22. Working with Industry?
• Many academic areas today have been turned upside down by the increased role of
industry as there is so much overlap between major University research issues and
problems where the large technology companies are battling it out for commercial
leadership.
• Correspondingly we have seen -- especially with top-ranked departments --- increasing
numbers and styles of industry-university collaboration. Probably most departments
must join this trend and increase their Industry links if they are to thrive.
• Sometimes faculty are 20% University and 80% Industry
• These links can have the student internship opportunity needed for ABET and
traditional in the field.
• However, this is a double-edged sword as the increased access to internships for our best Ph.D.
students is one reason for the decrease in University research contribution.
• We should try to make such internships part of joint University-Industry research and not just an
Industry only activity
• Relationship can be jointly run centers such as NSF I/UCRC and Industry oriented
University centers such as Intel parallel Computing centers
22
23. GAIMSC Global AI & Modeling Supercomputer Questions
• What do gain from the concept? e.g. Ability to work with Big Data community
• What do we lose from the concept? e.g. everything runs as slow as Spark
• Is GAIMSC useful for BDEC2 initiative? For NSF? For DoE?
For Universities? For Industry? For users?
• Does adding modeling to concept add value?
• What are the research issues for GAIMSC? e.g. how to program?
• What can we do with GAIMSC that we couldn’t do with classic Big Data
technologies?
• What can we do with GAIMSC that we couldn’t do with classic HPC technologies?
• Are there deep or important issues associated with the “Global” in GAIMSC?
• Is the concept of an auto-tuned Global AI Supercomputer scary?
23