1. EDF conducted a proof of concept to store and analyze massive time-series data from smart meters using Hadoop.
2. The proof of concept involved storing over 1 billion records per day from 35 million smart meters and running analytics queries.
3. Results showed Hadoop could handle tactical queries with low latency and complex analytical queries within acceptable timeframes. Hadoop provides a low-cost solution for massive time-series storage and analysis.
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Mike Rossi
Explosive growth of Smart Meter (SM) deployments has presented key infrastructure challenges across the utility industry. The huge volumes of smart meter data has led the industry to a tipping point which requires investments in modernizing existing data warehouses. Typical modernization efforts lead to huge capital expenditures for DW appliances and storage. Sizing this new infrastructure is tricky and can lead to underutilized or poorly performing hardware.
The Cloud is the catalyst to solving these Big Data challenges.
Utilizing a Cloud architecture delivers huge benefits by:
Maximizing use of existing architecture
Minimizing new CapEx expenditures
Lowering overall storage costs
Enabling scale on demand
San Antonio’s electric utility making big data analytics the business of the ...DataWorks Summit
Being part of a municipality-owned electric utility offers a unique opportunity to lead in the area of big data analytics. What moves the electric utility of the 7th largest city in the U.S.? The answer is, people. For years, CPS Energy has invested in development of local talent, local technology development, city growth, its employees, and an asset infrastructure that is setting the stage for continued success. At CPS Energy, when such investments are topped by a data infrastructure and applications conducive to creation of business insights, we can justify and prioritize investments. For us, the biggest people opportunities in big data analytics are around operations, customer and employee engagement, and safety. The presenter will provide examples and share how his views have evolved from those of a researcher to global renewable energy consultant to technology innovator and more recently a “harvester of value” from within people, process, and technology assets. Lastly, current and anticipated future states with regards to San Antonio’s electric utility big data enablement platform will be presented...
Speaker
Rolando Vega, Manager of Analytics and Business Insight, CPS Engery
Transforming GE Healthcare with Data Platform StrategyDatabricks
Data and Analytics is foundational to the success of GE Healthcare’s digital transformation and market competitiveness. This use case focuses on a heavy platform transformation that GE Healthcare drove in the last year to move from an On prem legacy data platforming strategy to a cloud native and completely services oriented strategy. This was a huge effort for an 18Bn company and executed in the middle of the pandemic. It enables GE Healthcare to leap frog in the enterprise data analytics strategy.
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Mike Rossi
Explosive growth of Smart Meter (SM) deployments has presented key infrastructure challenges across the utility industry. The huge volumes of smart meter data has led the industry to a tipping point which requires investments in modernizing existing data warehouses. Typical modernization efforts lead to huge capital expenditures for DW appliances and storage. Sizing this new infrastructure is tricky and can lead to underutilized or poorly performing hardware.
The Cloud is the catalyst to solving these Big Data challenges.
Utilizing a Cloud architecture delivers huge benefits by:
Maximizing use of existing architecture
Minimizing new CapEx expenditures
Lowering overall storage costs
Enabling scale on demand
San Antonio’s electric utility making big data analytics the business of the ...DataWorks Summit
Being part of a municipality-owned electric utility offers a unique opportunity to lead in the area of big data analytics. What moves the electric utility of the 7th largest city in the U.S.? The answer is, people. For years, CPS Energy has invested in development of local talent, local technology development, city growth, its employees, and an asset infrastructure that is setting the stage for continued success. At CPS Energy, when such investments are topped by a data infrastructure and applications conducive to creation of business insights, we can justify and prioritize investments. For us, the biggest people opportunities in big data analytics are around operations, customer and employee engagement, and safety. The presenter will provide examples and share how his views have evolved from those of a researcher to global renewable energy consultant to technology innovator and more recently a “harvester of value” from within people, process, and technology assets. Lastly, current and anticipated future states with regards to San Antonio’s electric utility big data enablement platform will be presented...
Speaker
Rolando Vega, Manager of Analytics and Business Insight, CPS Engery
Transforming GE Healthcare with Data Platform StrategyDatabricks
Data and Analytics is foundational to the success of GE Healthcare’s digital transformation and market competitiveness. This use case focuses on a heavy platform transformation that GE Healthcare drove in the last year to move from an On prem legacy data platforming strategy to a cloud native and completely services oriented strategy. This was a huge effort for an 18Bn company and executed in the middle of the pandemic. It enables GE Healthcare to leap frog in the enterprise data analytics strategy.
GITEX Big Data Conference 2014 – SAP PresentationPedro Pereira
Big, Fast and Predictive Data: How to Extract Real Business Value – in real time.
90% of the world’s data was created in the last two years. If you can harness it, it will revolutionize the way you do business. Big Data solutions can help extract real business value – in real time.
Big Data & Analytics continues to redefine business. Data has transitioned from an underused asset to the lifeblood of the organisation, and a critical component of business intelligence, insight and strategy.
Big Data Scotland is the largest annual data analytics conference held in Scotland: it is supported by ScotlandIS and The Data Lab and free for delegates to attend. The conference is geared towards senior technologists and business leaders and aims to provide a unique forum for knowledge exchange, discussion and cross-pollination.
The programme will explore the evolution of data analytics; looking at key tools and techniques and how these can be applied to deliver practical insight and value. Presentations will span a wide array of topics from Data Wrangling and Visualisation to AI, Chatbots and Industry 4.0.
Key Topics
• Tools and techniques
• Corporate data culture, business processes, digital transformation
• Business intelligence, trends, decision making
• AI, Real-time Analytics, IoT, Industry 4.0, Robotics
• Security, regulation, privacy, consent, anonymization
• Data visualisation, interpretation and communication
• CRM and Personalisation
Discuss building a trust solution for HealthIT or other regulated enterprises with blockchain using Hyperledger with Hbase for off-blockchain storage for scaling prototyped on Bluemix.
SMi Group is bringing to London this December, a new masterclass training course entitled Big Data for Utilities - combining and creating value from transactional, geospatial and real-time domain information. Don't miss this must attend course in association with Alliander and SAP UK & Ireland
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
Building Your Own Facebook Real Time Analytics System with Cassandra and GigaSpaces.
Facebook's real time analytics system is a good reference for those looking to build their real time analytics system for big data.
The first part covers the lessons from Facebook's experience and the reason they chose HBase over Cassandra.
In the second part of the session, we learn how we can build our own Real Time Analytics system, achieve better performance, gain real business insights, and business analytics on our big data, and make the deployment and scaling significantly simpler using the new version of Cassandra and GigaSpaces Cloudify.
What are big data in the contacts of energy & utilities, and how/where can the utilities find value in the data. In this C-level presentation we discussed the three prime areas: grid operations, smart metering and asset & workforce management. A section on cognitive computing for utilities have been omitted from the presentation due to confidentiality - but I tell you - it is mind-blowing perspectives on how IBM Watson will help utilities plan and optimize their operations in the near future!
See more on http://www.ibmbigdatahub.com/industry/energy-utilities
Check out this white paper from eInfochips which showcases how energy and utility providers can unlock potential service opportunities using our predictive analytics solution across all stages of the business cycle. Major utility players are set to roll out millions of smart meters with the aim of generating actionable insights even though as per the industry’s own admission, any serious effort toward monetization is being offset by a lack of core IT capabilities, especially in big data technology. Capturing proactive intelligence on consumer behavior is the way to go. In this white paper, eInfochips demonstrates how utility players can predict demand response, generation response and create new revenue models around coincidental peak demands, smart expenditure modeling and other forms of end user data.
From grid infrastructure analytics to consumer analytics, the true power of data is starting to be realized. Greentech Media Co-Founder and President, Rick Thompson, sets the stage for the days presentations and panels.
A Big Data Telco Solution by Dr. Laura Wynterwkwsci-research
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
CTO of ParStream Joerg Bienert hold a presentation on February 25, 2014 about Big Data for Business Users. He talked about several use cases of current ParStream customers and ParStreams' technology itself.
Protecting data privacy in analytics and machine learning ISACA London UKUlf Mattsson
ISACA London Chapter webinar, Feb 16th 2021
Topic: “Protecting Data Privacy in Analytics and Machine Learning”
Abstract:
In this session, we will discuss a range of new emerging technologies for privacy and confidentiality in machine learning and data analytics. We will discuss how to put these technologies to work for databases and other data sources.
When we think about developing AI responsibly, there’s many different activities that we need to think about.
This session also discusses international standards and emerging privacy-enhanced computation techniques, secure multiparty computation, zero trust, cloud and trusted execution environments. We will discuss the “why, what, and how” of techniques for privacy preserving computing.
We will review how different industries are taking opportunity of these privacy preserving techniques. A retail company used secure multi-party computation to be able to respect user privacy and specific regulations and allow the retailer to gain insights while protecting the organization’s IP. Secure data-sharing is used by a healthcare organization to protect the privacy of individuals and they also store and search on encrypted medical data in cloud.
We will also review the benefits of secure data-sharing for financial institutions including a large bank that wanted to broaden access to its data lake without compromising data privacy but preserving the data’s analytical quality for machine learning purposes.
Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value lies with thinking smart and hard about the business requirements for a Big Data solution. There is a long list of crucial questions to think about. Is Hadoop really the best solution for all Big Data needs? Should companies run a Hadoop cluster on expensive enterprise-grade storage, or use cheap commodity servers? Should the chosen infrastructure be bare metal or virtualized? The picture becomes even more confusing at the analysis and visualization layer. The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes in choosing hardware and software. This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and the selection of technologies and vendors.
Enerji Sektöründe Endüstriyel IoT Uygulamaları - Şahin Çağlayan (Reengen)ideaport
Reengen Enerji IoT Platformu kurucu ortağı ve AR-GE sorumlusu Sahin Çaglayan, nesnelerin interneti ve büyük veri analizi yeteneklerini bir araya getirerek ticari binalarda ve enerji şebekesinde bulut tabanlı optimizasyon süreçlerini anlattı.
-
23 Mart 2016
meet@ideaport | IoTxTR#21 'Enerji Sektöründe Endüstriyel IoT Uygulamaları' Semineri
GITEX Big Data Conference 2014 – SAP PresentationPedro Pereira
Big, Fast and Predictive Data: How to Extract Real Business Value – in real time.
90% of the world’s data was created in the last two years. If you can harness it, it will revolutionize the way you do business. Big Data solutions can help extract real business value – in real time.
Big Data & Analytics continues to redefine business. Data has transitioned from an underused asset to the lifeblood of the organisation, and a critical component of business intelligence, insight and strategy.
Big Data Scotland is the largest annual data analytics conference held in Scotland: it is supported by ScotlandIS and The Data Lab and free for delegates to attend. The conference is geared towards senior technologists and business leaders and aims to provide a unique forum for knowledge exchange, discussion and cross-pollination.
The programme will explore the evolution of data analytics; looking at key tools and techniques and how these can be applied to deliver practical insight and value. Presentations will span a wide array of topics from Data Wrangling and Visualisation to AI, Chatbots and Industry 4.0.
Key Topics
• Tools and techniques
• Corporate data culture, business processes, digital transformation
• Business intelligence, trends, decision making
• AI, Real-time Analytics, IoT, Industry 4.0, Robotics
• Security, regulation, privacy, consent, anonymization
• Data visualisation, interpretation and communication
• CRM and Personalisation
Discuss building a trust solution for HealthIT or other regulated enterprises with blockchain using Hyperledger with Hbase for off-blockchain storage for scaling prototyped on Bluemix.
SMi Group is bringing to London this December, a new masterclass training course entitled Big Data for Utilities - combining and creating value from transactional, geospatial and real-time domain information. Don't miss this must attend course in association with Alliander and SAP UK & Ireland
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
Building Your Own Facebook Real Time Analytics System with Cassandra and GigaSpaces.
Facebook's real time analytics system is a good reference for those looking to build their real time analytics system for big data.
The first part covers the lessons from Facebook's experience and the reason they chose HBase over Cassandra.
In the second part of the session, we learn how we can build our own Real Time Analytics system, achieve better performance, gain real business insights, and business analytics on our big data, and make the deployment and scaling significantly simpler using the new version of Cassandra and GigaSpaces Cloudify.
What are big data in the contacts of energy & utilities, and how/where can the utilities find value in the data. In this C-level presentation we discussed the three prime areas: grid operations, smart metering and asset & workforce management. A section on cognitive computing for utilities have been omitted from the presentation due to confidentiality - but I tell you - it is mind-blowing perspectives on how IBM Watson will help utilities plan and optimize their operations in the near future!
See more on http://www.ibmbigdatahub.com/industry/energy-utilities
Check out this white paper from eInfochips which showcases how energy and utility providers can unlock potential service opportunities using our predictive analytics solution across all stages of the business cycle. Major utility players are set to roll out millions of smart meters with the aim of generating actionable insights even though as per the industry’s own admission, any serious effort toward monetization is being offset by a lack of core IT capabilities, especially in big data technology. Capturing proactive intelligence on consumer behavior is the way to go. In this white paper, eInfochips demonstrates how utility players can predict demand response, generation response and create new revenue models around coincidental peak demands, smart expenditure modeling and other forms of end user data.
From grid infrastructure analytics to consumer analytics, the true power of data is starting to be realized. Greentech Media Co-Founder and President, Rick Thompson, sets the stage for the days presentations and panels.
A Big Data Telco Solution by Dr. Laura Wynterwkwsci-research
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
CTO of ParStream Joerg Bienert hold a presentation on February 25, 2014 about Big Data for Business Users. He talked about several use cases of current ParStream customers and ParStreams' technology itself.
Protecting data privacy in analytics and machine learning ISACA London UKUlf Mattsson
ISACA London Chapter webinar, Feb 16th 2021
Topic: “Protecting Data Privacy in Analytics and Machine Learning”
Abstract:
In this session, we will discuss a range of new emerging technologies for privacy and confidentiality in machine learning and data analytics. We will discuss how to put these technologies to work for databases and other data sources.
When we think about developing AI responsibly, there’s many different activities that we need to think about.
This session also discusses international standards and emerging privacy-enhanced computation techniques, secure multiparty computation, zero trust, cloud and trusted execution environments. We will discuss the “why, what, and how” of techniques for privacy preserving computing.
We will review how different industries are taking opportunity of these privacy preserving techniques. A retail company used secure multi-party computation to be able to respect user privacy and specific regulations and allow the retailer to gain insights while protecting the organization’s IP. Secure data-sharing is used by a healthcare organization to protect the privacy of individuals and they also store and search on encrypted medical data in cloud.
We will also review the benefits of secure data-sharing for financial institutions including a large bank that wanted to broaden access to its data lake without compromising data privacy but preserving the data’s analytical quality for machine learning purposes.
Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value lies with thinking smart and hard about the business requirements for a Big Data solution. There is a long list of crucial questions to think about. Is Hadoop really the best solution for all Big Data needs? Should companies run a Hadoop cluster on expensive enterprise-grade storage, or use cheap commodity servers? Should the chosen infrastructure be bare metal or virtualized? The picture becomes even more confusing at the analysis and visualization layer. The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes in choosing hardware and software. This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and the selection of technologies and vendors.
Enerji Sektöründe Endüstriyel IoT Uygulamaları - Şahin Çağlayan (Reengen)ideaport
Reengen Enerji IoT Platformu kurucu ortağı ve AR-GE sorumlusu Sahin Çaglayan, nesnelerin interneti ve büyük veri analizi yeteneklerini bir araya getirerek ticari binalarda ve enerji şebekesinde bulut tabanlı optimizasyon süreçlerini anlattı.
-
23 Mart 2016
meet@ideaport | IoTxTR#21 'Enerji Sektöründe Endüstriyel IoT Uygulamaları' Semineri
Universities as “Smart Cities” in a Globally Connected World - How Will They ...Larry Smarr
09.08.20
Invited Talk
Monash University ITS Strategic Planning Session
RE-INVENT to RE-POSITION – TRANSFORMED BY ICT
Title: Universities as “Smart Cities” in a Globally Connected World - How Will They be Transformed?
Melbourne, Australia
My presentation at the smart energy summit held in Singapore, March 2019. My talk focused on how to harness grid digitization capabilities to improve Distribution network reliability & integrate distributed renewable resources effectively.
En partenerait avec l'INFOPOLE Cluster TIC, le Cluster TWEED a eu le plaisir de vous convier au troisième workshop du cycle "Digital Energy Business & Technology Club", dont le thème était celui de l'Intelligence Artificielle dans l'énergie - tendances et opportunités. Découvrez les présentations des nombreux orateurs : DC Brain, Energis, Ingestic, N-Side, Opinum, Thelis-Réseau IA et Yazzoom !
"Presented at ICT4S 2013, the First International Conference on Information and Communication Technologies for Sustainability, held in Zurich, February 2013, http://www.ict4s.org".
What is a Smart Grid?
The Smart Grid Enables the ElectriNetSM
Local Energy Networks
Electric Transportation
Low-Carbon Central Generation
What Should Be the Attributes of the Smart Grid?
Why Do We Need a Smart Grid?
Is the Smart Grid a “Green Grid”?
Alternative Views of a Smart Grid
Fog Computing – Enhancing the Maximum Energy Consumption of Data Servers.dbpublications
Fog Computing and IoT systems make use of end-user premises devices as local servers. Here, we are identifying the scenarios for which running applications from NDCs are more energy-efficient than running the same applications from MDC. With the complete survey and analysis of various energy consumption factors such as different flow-variants and time-variants with respect to the Network Equipment we found two energy consumption use cases and respective results. Parameters such as current Load, Pmax, Cmax, Incremental Energy etc evolved with respect to system structure and various data related parameters leading to the conclusion that the NDC utilizes relatively reduced factor of energy comparative to the MDC. The study reveals that NDC as a part of Fog makeweights the MDCs to accompany respective applications, especially in the scenarios where IoT based applications are used where end users are the source data providers and can maximize the server utilization.
The definition of the "Smart Grid" is something that is taking shape. Utility professionals concur on some aspects and ideas of what the smart grid should be, but there are still grey areas that, however, promise to become clearer soon.
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
The Energy Industry is in transition due to the exponential growth of data being generated by the ever increasing number of connected devices which comprise the Smart Grid. Learn how Energyworx uses GCP to collect and ingest this IoT data with ease and is helping her customers uncover hidden value from this data, allowing them to create new business models and concepts.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Knowledge engineering: from people to machines and back
Proof of Concept for Hadoop: storage and analytics of electrical time-series
1. A proof of concept
with Hadoop :
storage and
analytics of
electrical time-
series
June 13th 2012
Bruno JACQUIN, Marie-Luce PICARD,
Leeley DAIO-PIRES DOS SANTOS,
Alzennyr GOMES DA SILVA,
David WORMS, Charles BERNARD
2. Outline
1. A very brief presentation of the EDF
Group
2. Smart metering data
3. Massive data management for utilities ?
4. A Proof of Concept using Hadoop
5. Conclusion and Perspectives
4. EDF GROUP PROFILE
EDF Group profile
¥ A leading player in the energy market, active in all areas of electricity from generation to
trading
and network management.
¥ Balance between regulated and deregulated activities.
¥ Expertise in engineering and operating generation plants and networks.
¥ Expertise in the design and promotion of energy eco-efficiency solutions.
in the French and UK electricity markets, solid positions in Italy and numerous
¥ Leader
other European countries; industrial operations in Asia and the United States.
37 million 630,4 TWh 108.9g of CO2
customers worldwide electricity generation worldwide per kWh generated
(CO2 emissions from EDF Group electricity
and heat generation)
158,842 €65.2 billion
employees worldwide in sales Consolidated data at 12.31.2010.
6. Smart-Grids projects everywhere in the
world ...
Key: red=electricity, green=gas, blue=water
and triangle=trial or pilot where circle=project
EDF R&D : Créer de la valeur et préparer l’avenir 6
7. The EDF Group: a bright outlook for smart
grids
Lower consumption peaks mean less Clearer information
dependence on high-carbon generation to raise awareness of energy saving strategies
Decarbonization of the energy mix Billing based on actual
through a smoother integration consumption
of renewable energies into networks
Smart
meter
Reduction in network losses to boost
competitivity of the system
Precision in targeted investments
for the maintenance
and modernization of networks
More efficient repairs to networks after extreme weather events
Promoting the development of electric transportation that emits fewer green house gases
New energy uses (e.g. electric mobility, storage, etc.)
8. Smart Grids : what ? And what for ?
" Environmental, economical, social and policy
drivers lead to a deep change of the energy sector:
" Climate change, environmental concerns
" Increased pressure of operational and financial efficiency
" Increasing awareness of consumers, role of citizens
" Technological pressure (IT, smart devices)
Source – Wikipedia
A smart grid delivers electricity from suppliers to consumers using
digital technology with two-way communications to control appliances
at consumers' homes to save energy, reduce cost and increase
reliability and transparency. It overlays the electrical grid with an
information and net metering system, and includes smart meters.
Such a modernized eletricity network is being promoted by many
governments as a way of addressing energy independence, global
warming and emergency resilience issues.
10. WhatData ou « The curve look like ?
Big does a load data deluge »
11. WhatData ou « The curve look like ? (2)
Big does a load data deluge »
12. WhatData ou « The curve look like ? (3)
Big does a load data deluge »
Individual load curves :
- Left : same customer, two
different days
- Up: same day, two different
customers
14. Massive data management in the energy
domain: myth or reality ?
" Challenges :
" More complexity in the electric power system (demand
response, distributed generation …)
" Faster evolution of customer indoor equipment (smart meters
and devices, Internet of Things …)
ð Core business will involve more IT and data management
" The R&D SIGMA project deals with scalability and Big
Data :
" Skillson Big Data techniques
" Prototyping on business cases
" With internal (IT), academic or industrial partners
15. Massive data management in the energy
domain: myth or reality ?
" The SIGMA project studies and experiments
appropriate methods and techniques
" Storage technologies for massive data sets, especially time-
series
" Data processing :
" Complex Event Processing, real time analytical processing
" Large scale data-mining : massively parallel processing,
distributed data-mining
" Use cases
" Smart-grids, CRM and customer insight, generation
optimization : consumption and production forecasting, power
plant maintenance
17. Storing massive time series
" Objective: Proof of Concept for running a large number
of queries (variable levels of complexity with variable
scopes and frequencies, variable acceptable latencies)
on a huge number of load curves
" Data:
individual curves, weather data, contractual information,
network data
" 1 measurement every 10 mn for 35 million customers a
year
" Annual volume of data
" 1800 billion records ; 120 TB uncompressed data
18. Storing massive time series: objectives
" Build an « operational Data Warehouse » able to:
" Supply a large volume of data
" Ingest new coming data
" Pre-processing, synchronization and filling
" Allow concurrent and simultaneous queries
" Tactical queries: Curve selection compared with a
mean curve
" Analytical queries: Aggregated curves
" Ad-hoc queries
" ‘Recoflux’ (simplified)
" Extraction capabilities
20. Using relational technologies for storing
massive time series
" Relational approaches, Very Large DataBases
" Works
carried out with partners: Teradata, Oracle, IBM,
EMC², HP
" Appliances or software offers,
" Shared-nothing or shared-everything ; Column-based, line or
hybrid mode?
" Separation between an operational use (ODS) and an
analytical use (DWH)?
21. Using Hadoop for storing massive time series
" Native distributed file-system (HDFS)
" Distributed treatments using the Map/Reduce paradigm
" Large
dotcom usage but very limited industrial deployment,
maturity is yet to come despite the major editors arriving
with offers including integration, appliances and support
" Internal POC concluded in April 2012
22. The Data model The data deluge »
Big Data ou «
Compressed data
Volume on HDFS :
è 10 TB (x3)
24. Design
" Hive in the center of our DW
" HBase at the forefront of data access
25. Design
" Hive in the center of our DW
" Allows ad-hoc and complex analytical queries
" Customer tables stored as rcfile are replicated in all Data
Nodes (19)
" Consumption measurements are partitioned by day and
customer profil criteria
" Daily volume: 25 GB ; Average block size: 10 MB
" HBase at the forefront of data access
" Allows low latencies queries
" Recent metering data stored “In Memory” tables
" Stores a subset of measurements and aggregates in
tables with “Bloom filters” enabled
26. Hardware configuration: the cluster
" 20 nodes in 2 racks:
" 7 x 1U nodes with 4 x 1 TB
" 13 x 2U nodes with 8x1 TB
" Total : 132 TB ; 336 cores (AMD)
" Hadoop distribution : Cloudera CDH3u3 (open source)
28. Time series representation models: options
§ TUPLE
CREATE TABLE cdc_tuple ( id_cdc INT, date_releve TINYINT,
p INT )
PARTITIONED BY(day STRING, optarif STRING, psousc
TINYINT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
STORED AS RCFILE;
§ ARRAY
CREATE TABLE cdc_array ( id_cdc INT, values array<
array< int > > ) …
§ COLUMN
CREATE TABLE cdc_144_cols (id_cdc INT, p1 INT, p2 INT,
…, p144 INT )…
29. Time series representation models: options
Getting a daily individual load curve
‘select * from cdc_tuple where day='2008-01-01' and id_cdc =
136630;’
30. Time series representation models: impact
" Computing a global aggregated load curve for 1 day
Representation model Daily volume Query execution time
Tuple 10.1 GB ( x 3 replicas) 2 min 22 sec
Column 8.8 GB ( x 3 replicas) 1 min 17 sec
Array 16 GB ( x 3 replicas) 1 min 18 sec
31. Results - HBase
" Tactical queries are successfully handled by HBase, offering
low latencies under a high concurrent load.
Representation Period Nb concurrent Queries / Sec Query execution
model execution queries time (seconds)
time
Columns (7 * 144) 1 minute 100 470 0.21
Array (7 x 1 array of 1 minute 100 495 0.20
144 values)
Columns (7 * 144) 5 minutes 500 524 0.19
Array (7 x 1 array of 5 minutes 500 430 0.18
144 values)
Query: curve selection
32. Results – Hive (1)
Query Execution time
(tuples representation)
Aggregation France (sum) 10 min interval 1 min, 56 sec
Load curve aggregated by contractual information 2 min, 21 sec
Analysing consumption trends according to the customers building 1 heure, 18 sec
caracteristics
TOP N customers candidates for a power level update 1 heure, 7 min, 35 sec
Results for different queries:
- Planned queries (with adequate partitioning)
- ad-hoc queries
33. Results – Hive (2)
" Recoflux scenarios
Scénario Mode séquentiel (minutes) Mode parallèle (minutes)
1 jour 1 semaine 1 jour 1 semaine
521 1.44 10.10 1.56 3.00
522 27.87 195.09 28.50 31.01
523 7.98 23.94
524 10.71 74.99 15.97 19.58
525 6.10 42.70 7.45 8.39
526 0.86 6.08 0.92 2.43
" Recoflux is a very important business application (power
consumption aggregations are computed according to
different criteria ; updates and temporal data): results really
acceptable
34. Using NoSQL technologies for storing
massive time series: results
Integration Hadoop / Tableau Software : visualisation of 700k feeders
35. Alternative approach for storing massive
time-series : conclusions
" The less
" Not yet mature, a few feedbacks available in the industry
" Lack of competences in Europe (impact of configuration and
tuning, smart skills)
" Major editors offering: still young but actively emerging
" The more
" Low cost
" Ability to recycle existing commodity hardware
" One of the few solution which allows the coupling between
structured and unstructured data
" Flexibility despite being a complex system to deploy and
manage. Fault tolerant and scalable.
36. Alternative approach for storing massive
time-series : conclusions
" Perspectives
" Partnersoffering industrial support
" Hardware configuration
" Usage of statistical libraries
" Connectivity with the relational world
" USAGES:
" ETL,
" intelligent and reliable archival solution,
" high throughput data presentation (publication)
37. Conclusions and perspectives
" Hadoop perspectives
" Non traditional usage of Hadoop using a structured schema
" Will become a component of the company IS for non-critical
usages
" Any suggestions ?
" storage mode for time-series ?
" usages ?
" Contacts:
" marie-luce.picard@edf.fr
" bruno.jacquin@edf.fr