The document discusses EDF's use of a data lake and data lab to optimize operations and safety at their nuclear power plants. It describes how EDF is building a Hadoop-based data lake called ESPADON to store sensor and operational data from their 59 nuclear plants. They are developing data science algorithms to analyze the data from the whole fleet to improve maintenance and operations. EDF is also creating a data lab team and architecture to develop analytics and quantify the value of these initiatives.
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
AWS provides the most comprehensive, secure, scalable, and cost-effective portfolio of services for building and managing data lakes. Now with AWS Lake Formation, you can build a secure data lake in days. In this session, learn how Lake Formation makes it simple to discover, catalog, clean, and load your data into a new data lake. Discover how you can easily secure access to that data and analyze it with services like Amazon Athena, Amazon Redshift, and Amazon EMR. Hear about Alcon’s data lake journey to the AWS Cloud and the challenges it overcame for a successful and productive data lake implementation.
Simple icons to assist with technical diagrams covering basic physical and software components of Hadoop architectures.
Download the Visio and Omnigraffle stencils, EPS and HiRes PNGs here: http://bit.ly/17mQJ9k
Have you ever wondered what the relative differences are between two of the more popular open source, in-memory data stores and caches? In this session, we will describe those differences and, more importantly, provide live demonstrations of the key capabilities that could have a major impact on your architectural Java application designs.
One of the most important factors to an organization’s success is its ability to extract actionable information from its data. However, the exponential growth of available data has put numerous operational pressures on IT and storage administrators to effectively ingest, transfer, process, store, backup, and archive. AWS offers numerous data transfer and storage services and solutions that can scale with your data growth and help meet security and compliance requirements. Attend this session to learn how to use AWS storage services to manage the entire lifecycle of your data, from ingestion to archive.
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
( Talend Training: https://www.edureka.co/talend-for-big-data )
This Edureka video on What Is Talend will give you the complete insights of What actually is Talend, its various products and how it is being used in the industry.
This video helps you to learn following topics:
1. What Is Talend?
2. Evolution Of Talend
3. Talend Products
4. Use Cases
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button above and click the bell icon.
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
AWS provides the most comprehensive, secure, scalable, and cost-effective portfolio of services for building and managing data lakes. Now with AWS Lake Formation, you can build a secure data lake in days. In this session, learn how Lake Formation makes it simple to discover, catalog, clean, and load your data into a new data lake. Discover how you can easily secure access to that data and analyze it with services like Amazon Athena, Amazon Redshift, and Amazon EMR. Hear about Alcon’s data lake journey to the AWS Cloud and the challenges it overcame for a successful and productive data lake implementation.
Simple icons to assist with technical diagrams covering basic physical and software components of Hadoop architectures.
Download the Visio and Omnigraffle stencils, EPS and HiRes PNGs here: http://bit.ly/17mQJ9k
Have you ever wondered what the relative differences are between two of the more popular open source, in-memory data stores and caches? In this session, we will describe those differences and, more importantly, provide live demonstrations of the key capabilities that could have a major impact on your architectural Java application designs.
One of the most important factors to an organization’s success is its ability to extract actionable information from its data. However, the exponential growth of available data has put numerous operational pressures on IT and storage administrators to effectively ingest, transfer, process, store, backup, and archive. AWS offers numerous data transfer and storage services and solutions that can scale with your data growth and help meet security and compliance requirements. Attend this session to learn how to use AWS storage services to manage the entire lifecycle of your data, from ingestion to archive.
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
( Talend Training: https://www.edureka.co/talend-for-big-data )
This Edureka video on What Is Talend will give you the complete insights of What actually is Talend, its various products and how it is being used in the industry.
This video helps you to learn following topics:
1. What Is Talend?
2. Evolution Of Talend
3. Talend Products
4. Use Cases
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button above and click the bell icon.
This slide will help to understand how to use WEKA tool for association rule mining. It has a brief overview of how to prepare dataset for using it in WEKA and how to visualize it.
A School ERP System or enterprise resource planning system, is used to manage and plan the assets and resources of an educational institute.
VISIT: https://www.edujournal.com/school-erp-system/
Cloud Presentation and OpenStack case studies -- Harvard UniversityBarton George
The presentation walks through the forces affecting IT in higher education today, the value of a cloud brokerage model and case studies of OpenStack-based clouds in higher education. Presented at the Harvard University IT summit.
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce you to SPICE - a Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivAmazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
A small and brief presentation for internship project at BEL on Data Visualization using Seaborn and matplotlib
Some sensitive information has been redacted.
"Conceptually, a data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created “on demand”, providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we will introduce key concepts for a data lake and present aspects related to its implementation. We will discuss critical success factors, pitfalls to avoid as well as operational aspects such as security, governance, search, indexing and metadata management. We will also provide insight on how AWS enables a data lake architecture.
A data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created ""on demand"", providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we introduce key concepts for a data lake and present aspects related to its implementation. We discuss critical success factors and pitfalls to avoid, as well as operational aspects such as security, governance, search, indexing, and metadata management. We also provide insight on how AWS enables a data lake architecture. Attendees get practical tips and recommendations to get started with their data lake implementations on AWS."
Testing Big Data application is more a verification of its data processing rather than testing the individual features. It demands a high level of testing skills as the processing is very fast.
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
Gain in-depth knowledge and best practices for migrating commercial data warehouses to Amazon Redshift using AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT). We use an example based on an Oracle data warehouse, and we discuss approaches to migrate it to Amazon Redshift. We also discuss some of the common challenges, limitations, and workarounds, as well as the option of using AWS Snowball to migrate very large data warehouses to Amazon Redshift.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
This slide will help to understand how to use WEKA tool for association rule mining. It has a brief overview of how to prepare dataset for using it in WEKA and how to visualize it.
A School ERP System or enterprise resource planning system, is used to manage and plan the assets and resources of an educational institute.
VISIT: https://www.edujournal.com/school-erp-system/
Cloud Presentation and OpenStack case studies -- Harvard UniversityBarton George
The presentation walks through the forces affecting IT in higher education today, the value of a cloud brokerage model and case studies of OpenStack-based clouds in higher education. Presented at the Harvard University IT summit.
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce you to SPICE - a Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivAmazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
A small and brief presentation for internship project at BEL on Data Visualization using Seaborn and matplotlib
Some sensitive information has been redacted.
"Conceptually, a data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created “on demand”, providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we will introduce key concepts for a data lake and present aspects related to its implementation. We will discuss critical success factors, pitfalls to avoid as well as operational aspects such as security, governance, search, indexing and metadata management. We will also provide insight on how AWS enables a data lake architecture.
A data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created ""on demand"", providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we introduce key concepts for a data lake and present aspects related to its implementation. We discuss critical success factors and pitfalls to avoid, as well as operational aspects such as security, governance, search, indexing, and metadata management. We also provide insight on how AWS enables a data lake architecture. Attendees get practical tips and recommendations to get started with their data lake implementations on AWS."
Testing Big Data application is more a verification of its data processing rather than testing the individual features. It demands a high level of testing skills as the processing is very fast.
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
Gain in-depth knowledge and best practices for migrating commercial data warehouses to Amazon Redshift using AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT). We use an example based on an Oracle data warehouse, and we discuss approaches to migrate it to Amazon Redshift. We also discuss some of the common challenges, limitations, and workarounds, as well as the option of using AWS Snowball to migrate very large data warehouses to Amazon Redshift.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Logistique : Le transport dans le commerceThomas Malice
Logistique : Le transport dans le commerce.
L'évolution du transport dans le commerce international.
Comparaison des différents moyens de transport internationaux.
L'utilisation des transports dans les échanges commerciaux.
Etudes de cas : Kiala, TNT Express, DPD, DHL, FNAC
Creative Capital, Information & Communication Technologies, & Economic Growth...Regional Science Academy
Presentation by Amit Batabyal, Rochester Institute of Technology
Advanced Brainstorm Carrefour (ABC): ‘Smart People in Smart Cities’
Matej Bel University, Banská Bystrica, Slovakia (August, 2016)
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
In this deck from DataTech19, Debbie Bard from NERSC presents: Supercomputing and the scientist: How HPC and large-scale data analytics are transforming experimental science.
"Debbie Bard leads the Data Science Engagement Group NERSC. NERSC is the mission supercomputing center for the USA Department of Energy, and supports over 7000 scientists and 700 projects with supercomputing needs. A native of the UK, her career spans research in particle physics, cosmology and computing on both sides of the Atlantic. She obtained her PhD at Edinburgh University, and has worked at Imperial College London as well as the Stanford Linear Accelerator Center (SLAC) in the USA, before joining the Data Department at NERSC, where she focuses on data-intensive computing and research, including supercomputing for experimental science and machine learning at scale."
Watch the video: https://wp.me/p3RLHQ-kLV
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...Larry Smarr
10.10.11
Presentation by Larry Smarr to the NSF Campus Bridging Workshop
Title: A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political, and Economic
Anaheim, CA
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impactinside-BigData.com
In this deck from the 2014 HPC User Forum in Seattle, John A. Turner from Oak Ridge National Laboratory presents: Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact.
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...Flink Forward
DTW: Dynamic Time Warping is a well-known method to find patterns within a time-series. It has the possibility to find a pattern even if the data are distorted. It can be used to detect trends in sell, defect in machine signals in the industry, medicine for electro-cardiograms, DNA…
Most of the implementations are usually very slow, but a very efficient open source implementation (best paper SIGKDD 2012) is implemented in C. It can be easily ported in other language, as Java, so that it can be then easily used in Flink.
We present how we did some slight modifications so that we can use with Flink at even greater scale to return the TopK best matches on past data or streaming data.
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
In this session, you will learn how CERN easily applied end-to-end deep learning and analytics pipelines on Apache Spark at scale for High Energy Physics using BigDL and Analytics Zoo open source software running on Intel Xeon-based distributed clusters.
Technical details and development learnings will be shared using an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. The classifier has demonstrated very good performance figures for efficiency, while also reducing the false positive rate compared to the existing methods. It could be used as a filter to improve the online event selection infrastructure of the LHC experiments, where one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives.
This is part of CERN’s research on applying Deep Learning and Analytics using open source and industry standard technologies as an alternative to the existing customized rule based methods. We show how we could quickly build and implement distributed deep learning solutions and data pipelines at scale on Apache Spark using Analytics Zoo and BigDL, which are open source frameworks unifying Analytics and AI on Spark with easy to use APIs and development interfaces seamlessly integrated with Big Data Platforms.
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future.
Watch the video of this presentation: http://insidehpc.com/?p=36343
This is a presentation by Prof. Anne Elster at the International Workshop on Open Source Supercomputing held in conjunction with the 2017 ISC High Performance Computing Conference.
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear fleet (20)
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear fleet
1. A Data Lake and a Data Lab to Optimize
Operations and Safety Within a Nuclear Fleet
Hadoop Summit 2016, San José, June 30th
Marie-Luce PICARD, EDF R&D – marie-luce.picard@edf.fr
Jean-Marc RANGOD, EDF-DPNT
Christophe SALPERWYCK, EDF R&D
Special thanks to Raphaël QUERCIA EDF-DTG, Carole MAI and Amandine PIERROT EDF R&D
2. 2
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
3. 3
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
4. 4
ELECTRICITY GENERATION
623.5 TWH
All electricity-related activities
Generation
Transmission & Distribution
Trading and Sales & Marketing
Energy services
Key figures*
€72.9 billion in sales
38.5 million customers
158,161 employees worldwide
84.7% of generation does not emit CO2
2014 INVESTMENTS
€4.5 BILLION
EDF: A GLOBAL LEADER IN ELECTRICITY
*as of 2015
EDF :
AN EFFICIENT,
RESPONSIBLE
ELECTRICITY COMPANY
AND THE CHAMPION
OF LOW-CARBON
GROWTH
5. WORLD’S LEADING OPERATOR, EXCELLENT
PERFORMANCE IN FRANCE
72.9 GW installed capacity, 54% of the Group’s net generation
capacity
477.7 TWh generated, 77% of the Group’s output
58 reactors operated in France,
15 in the UK
3 EPR under construction:
— 1 in Flamanville (France)
— 2 in Taishan (China)
2 EPR in project phase
OSART safety audit
17 best practices identified by IAEA
France
Best generation performance for six years
UK
World record for safety in the workplace
China
Strengthened cooperation agreement with CNNC
NUCLEAR
EDF 2015 I P.5
8. Scientific
partnerships with
actors of Paris-
Saclay
research departments
8
exceptional buildings
4 outstanding hall test
1 Unique equipment,
innovative
communication
tools
Diverse areas of
expertise
1500
work stations
Plenty of
collaborative
spaces
EDF LAB PARIS-SACLAY
9. 9
Main Big Data related challenges for EDF
Power Generation
Process monitoring and condition-based maintenance
from sensors
Power generation forecasting for renewables
Energy management
Load forecasting
Balancing and optimizing generation and consumption
(using smart metering information, including
renewables)
Electrical networks
Smart Grid operations (local)
Condition-based maintenance
Customers and sales
New services to customers using smart-metering data
Smart Homes, Smart Building, Smart Cities management
related to energy
10. 10
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
11. 11
Operations and maintenance of the nuclear fleet
The maintenance policy of EDF generation fleet is optimized to ensure reliability and safety of
equipment and systems while strengthening our competitiveness:
Have better diagnosis, improved performance and availability
Make a better use of data and documents, so far stored into Data silos
More globally, the IT teams and projects aim at:
Strengthen performance of operations and maintenance through a global fleet approach
Simplify the Industrial Information System architecture
Improve and develop the way we use our data
Accumulate and archive data through time
… while reducing costs
12. 12
Voluminous and heterogeneous data …. stored in data silos
Source : Wikipedia
One DB by nuclear site, gathering data from
sensors. Use of Data Historians.
Focus on data:
High volume:
data is stored up to 40-60 years (lifetime of the plant)
SCADA data can be sampled every 20 to 40 ms (but mainly a few
seconds)
Around 10.000 sensors per plant
Variety:
Data is heterogeneous
Time series, images, documents
Various data sources
The actual systems (historians) don’t allow
too many concurrent access, and their SLA are
quite bad
14. 14
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
16. 16
Zoom on data
4 generations of plants, but high level of normalization of data and sensors (for
example, use of trigrams for identification of elementary systems)
Two main types of sensors : ANA (for analogic) and TOR (for state events)
Time series
Volume
For the POC, 10 plants, 2 years: about 20 billions of points
Target (59 plants) : 15 To of data (all plants, whole lifecycle)
Metric, global Date Value Quality
BU2ABP177MT- 2015-04-30T22:05:00.000Z 156.6 Good/M
BU2ABP177MT- 2015-04-30T22:06:00.000Z 156.4 Good/M
BU2ABP177MT- 2015-04-30T22:07:00.000Z 156.2 Good/M
BU2ABP177MT- 2015-04-30T22:08:00.000Z 156.0 Good
BU2ABP177MT- 2015-04-30T22:09:00.000Z 156.2 Good/M
BU2ABP177MT- 2015-04-30T22:10:00.000Z 156.4 Good/M
BU2ABP177MT- 2015-04-30T22:12:00.000Z 156.7 Good/M
BU2ABP177MT- 2015-04-30T22:14:00.000Z 157.1 Good
BU2ABP177MT- 2015-04-30T22:15:00.000Z 157.3 Good
BU2ABP177MT- 2015-04-30T22:16:00.000Z 157.5 Good
BU2ABP177MT- 2015-04-30T22:19:00.000Z 157.3 Good/M
BU2ABP177MT- 2015-04-30T22:20:00.000Z 157.1 Good/M
BU2ABP177MT- 2015-04-30T22:21:00.000Z 157.3 Good/M
BU2ABP177MT- 2015-04-30T22:22:00.000Z 157.1 Good/M
BU2ABP177MT- 2015-04-30T22:24:00.000Z 156.9 Good/M
BU2ABP177MT- 2015-04-30T22:27:00.000Z 157.1 Good/M
BU2ABP177MT- 2015-04-30T22:28:00.000Z 157.3 Good/M
BU2ABP177MT- 2015-04-30T22:29:00.000Z 157.5 Good/M
BU2ABP177MT- 2015-04-30T22:30:00.000Z 157.7 Good/M
17. 17
Data model
Use of HBASE and PHOENIX
Distributed key/values store
Allows models update (normalization requirements evolution, new indicators… new plants)
Phoenix for SQL compliance + BI tools
Tables
3 tables : DDT, ANA, TOR
Rowkey : <sensorid, timestamp> (queries mainly consider one or several sensors for a period of time)
Sequential storage ; split into Hfiles and Hregion according to the plant unit
Clé ColumnFamily Colonne Valeur Phoenix type
m
(concat(metriquei
d, timestamp))
0 v H_ValeurANA Float
q H_QualitéANA Char(10)
n H_NiveauxANA varchar(10)
Clé ColumnFamily Colonne Valeur Phoenix type
m
(concat(metriquei
d, timestamp))
0 v H_ValeurTOR Varchar(10)
q H_QualiteTOR Char(10)
n H_NiveauxTOR Varchar(10)
18. 18
Validation and performances evaluation
POC validation
Upload of historical data; queries / analyses
Existing functions: viz, reports, services
Data injection: SCADA for the whole fleet,
integration of other sources of data
Results
6 weeks (estimated) needed to upload historical data
from 59 plants
Queries for validating the model :
Use of Jmeter for simulating load
With or without insertion workload
~ < 1 second for drawing a curve for a selected month
Integration of an existing GUI for viz (realized within a
few days)
Validation of specific calculation within reports
ODBC link for specific e-monitoring application
Integration of various sources of (structured) data into
the data lake
‘Real-time’ insertion of data (micro-batch):
Up to 2M points / s
Very low latency between insertion and availability (< 10s)
SELECT
MIN(v), MAX(v),
FIRST_VALUE(v) WITHIN GROUP (ORDER BY ts ASC),
LAST_VALUE(v) WITHIN GROUP (ORDER BY ts ASC),
TO_CHAR(ts, 'dd') as day,
TO_CHAR(ts, 'HH') as hour,
TO_CHAR(ts, 'mm') as minute,
count(*) as cnt
FROM
ORLI_ANA
WHERE
m = ? AND
ts > current_time()-1 AND //last 24h
ts < current_time()
GROUP BY
day, hour, minute
Phoenix query (ANA)
19. 19
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
20. 20
Added value of data science algorithms on heterogeneous data:
Operations and maintenance can be better optimized through data analytics run on
data coming from the whole fleet
Active and reactive power are indicators of constraints on alternators: effect on
their wears
• ~ 50 plants
• 20 years of data
• 10 min interval data
• Phoenix queries allow to select plants and periods of time
• Compute and show reactive power per day or per hour of the
day
• More detailed analysis
• Fleet level analysis
• Interactive queries
21. 21
Added value of data science algorithms on heterogeneous data:
Operations and maintenance can be better optimized through data analytics run on
data coming from the whole fleet
Monitoring and control of contractual agreements when network frequency
varies (plants have to contribute to the global balance)
• Pattern matching
• Response time for different plants
• Different levels of analysis : by plant, by
generation, global
• Generic approach implemented for any
kind of patterns
22. 22
Added value of data science algorithms on heterogeneous data
Prediction of plants cooling according to the quality of incoming water in the
plants
• Correlations?
• According to the plants
• Use of GAM models
• Integration of two internal sources +
external data
• Better understanding
• // Work in progress //
23. 23
Integration of data science and visualization: architecture
Hadoop Cluster Web Service REST
(VM)
Browser
24. 24
Integration of data science: a global approach
Pre-processing
Data quality
Sampling
Synchronization
…
Selection and queries
Threshold
Pattern matching
Period of time
…
Analysis and data science
Reporting
Exploratory analysis
(distribution …)
Modelling
…
25. 25
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
26. 26
A Data Lab in progress: a team, an approach …
… and some questions
Objectives:
Bring value from data analytics
Issues:
Skills and organization (between entities)
Architecture :
Operational Hadoop cluster and loads (use of a multitenant
enterprise cluster)
Other loads (data science)
Data prep within Hadoop + edge machine for data science (Spark, R,
Python)
How to quantify value
Developments costs and maintenance
How to industrialize
Source: Xebia
27. 27
Outline
1. A FEW WORDS ABOUT EDF
2. CONTEXT AND OBJECTIVES
3. A DATA LAKE FOR A NUCLEAR FLEET
4. DATA SCIENCE ALGORITHMS FOR OPTIMIZING OPERATIONS
5. A DATA LAB IN PROGRESS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
28. 28
Takeaways
A Data Lake for our nuclear fleet
In progress : industrialization and decommissioning of Historian applications
Great reduction of licensing costs
A Data Lab under construction
POCs showing the added value of data science algorithms
predictive maintenance
In the context of fleet renovation for plant life extension (major overhaul program): operations & maintenance, generation
costs optimization
Issues remaining : skills, organization, technical architecture, quantify value
Perspectives and technical issues:
Data lakes and labs for other fleets (thermal plants, hydro, renewables)
Scalable time-series analytics (synchronization, missing data …)
Handling heterogeneous data (textual, images, graphs …)
IoT platform
29. References
A proof of concept with Hadoop: storage and analytics of electrical time-series.
Marie-Luce Picard, Bruno Jacquin, Hadoop Summit 2012, Californie, USA, June 2012: http://www.slideshare.net/Hadoop_Summit/proof-of-
concent-with-hadoop
Massive Smart Meter Data Storage and Processing on top of Hadoop.
Leeley D. P. dos Santos, Alzennyr G. da Silva, Bruno Jacquin, Marie-Luce Picard, David Worms,Charles Bernard. Workshop Big Data 2012,
Conférence VLDB (Very Large Data Bases), Istanbul, Turquie, 2012: http://www.cse.buffalo.edu/faculty/tkosar/bigdata2012/program.php
Searching time-series with Hadoop in an electric power company.
Alice Bérard, Georges Hébrail, BigMine Workshop, KDD2013, Chicago, August 2013: http://bigdata-mining.org/
Real-time energy data-analytics with Storm.
Rémy Saissy, Marie-Luce Picard, Charles Bernard, Bruno Jacquin, Simon Maby, Benoît Grossin, Hadoop Summit 2014, Californie, USA, June
2014: http://fr.slideshare.net/Hadoop_Summit/t-525p212picard
Computing Data Quality Indicators on Big Data Stream Using a CEP
Wenlu Yang, Alzennyr Gomes Da Silva, Marie-Luce Picard, IEEE Xplore - IWCIM 2015, Prague, Novembre 2015.
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Network
Guillaume Germaine, Thomas Vial, Hadoop Summit Europe 2016, Dublin
http://www.slideshare.net/HadoopSummit/exploring-titan-and-spark-graphx-for-analyzing-timevarying-electrical-networks
Editor's Notes
Nuclear energy supplies competitive, carbon-free electricity that we generate in the best possible safety conditions.
In 2014, the International Atomic Energy Agency conducted an audit on how nuclear safety is integrated into the organisation and processes of our central departments: the IAEA found no departure from its standards and identified 17 best practices.
→ In France, we achieved our best performance in six years thanks to our management of scheduled shutdowns: the average length of extensions was halved. Wintertime fleet availability topped 90%. Our annual output was up 3% (415.9 TWh).
• The principle of the “Grand Carénage” maintenance programme was approved. The programme involves renovating the French nuclear fleet over a 10-year period in order to extend its operating life beyond 40 years if all conditions are met. The investment is put at €55 billion for the entire fleet.
• The Flamanville EPR worksite is continuing, the first nuclear plant to be built in France for 15 years.
→ In the UK, output was good (56.3 TWh) despite the unscheduled shutdown of two plants. EDF Energy established a world record for safety in the workplace (0.98 accidents requiring more than one day of lost time per million hours worked by employees and subcontractors).
• The Hinkley Point C project to build two EPR in Somerset took a major step forward: in October, the European Commission approved the main terms of the agreements concluded with the British government.
→ In China, through partnerships, we are taking good advantage of the expertise we have acquired in the design, construction, operation and maintenance of our nuclear fleet.
• Construction of two 1,750 MW EPR in Taishan (EDF 30% in partnership with CGN) is ongoing.
• We signed an agreement to strengthen cooperation in engineering, operation and maintenance with CNNC, China’s largest state-owned nuclear company.