Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements by means of complex constraints. Examples of such scenarios are the following:
• Deep learning applications need to run on GPU machines with specific GPU models and driver/kernel versions.
• Hive or Spark applications benefit from being collocated on the same rack to reduce network cost and thus speed up their execution. At the same time, it is desirable to limit the number of allocations per machine to minimize resource interference.
• Low-latency services such as HBase need to be allocated across failure domains to improve their availability.
• A DNS service might need to run on machines with public IP address.
In this talk we present the brand new addition of expressive placement constraints in YARN. We show how applications can leverage such constraints to achieve complex placements, such as collocating their allocations on the same node/rack (affinity), spreading their allocations across nodes/racks (anti-affinity), or allowing up to a specific number of allocations per node group (cardinality) to strike a balance between the two. We describe real use cases from production clusters and show the benefits of placement constraints on large clusters using popular applications in both on-prem and cloud settings.
Speakers
Konstantinos Karanasos, Senior Scientists, Microsoft
Wangda Tan, Staff Software Engineer, Hortonworks
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
In this talk we will show how Hadoop Ecosystem tools like Apache Kafka, Spark, and MLLib can be used in various real-time architectures and how they can be used to perform real-time detection of a DDOS attack. We will explain some of the challenges in building real-time architectures, followed by walking through the DDOS detection example and a live demo. This talk is appropriate for anyone interested in Security, IoT, Apache Kafka, Spark, or Hadoop.
Presenter Ryan Bosshart is a Systems Engineer at Cloudera and is the first 3 time presenter at BigDataMadison!
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...AppDynamics
Join this session to hear the details about AppDynamics Unified Analytics, including the latest features and architecture. Gain the information you need to understand how to use your data effectively to improve your software, operations, and business performance, whether you're in DevOps, IT ops, application support, engineering, or product management. Deep dive into architecture and technology and how the product scales.
Key takeaways:
o New features and key technological advances such as advanced searches, smart insight, streaming, and centralized log configuration management
For more information, go to: www.appdynamics.com
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements by means of complex constraints. Examples of such scenarios are the following:
• Deep learning applications need to run on GPU machines with specific GPU models and driver/kernel versions.
• Hive or Spark applications benefit from being collocated on the same rack to reduce network cost and thus speed up their execution. At the same time, it is desirable to limit the number of allocations per machine to minimize resource interference.
• Low-latency services such as HBase need to be allocated across failure domains to improve their availability.
• A DNS service might need to run on machines with public IP address.
In this talk we present the brand new addition of expressive placement constraints in YARN. We show how applications can leverage such constraints to achieve complex placements, such as collocating their allocations on the same node/rack (affinity), spreading their allocations across nodes/racks (anti-affinity), or allowing up to a specific number of allocations per node group (cardinality) to strike a balance between the two. We describe real use cases from production clusters and show the benefits of placement constraints on large clusters using popular applications in both on-prem and cloud settings.
Speakers
Konstantinos Karanasos, Senior Scientists, Microsoft
Wangda Tan, Staff Software Engineer, Hortonworks
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
In this talk we will show how Hadoop Ecosystem tools like Apache Kafka, Spark, and MLLib can be used in various real-time architectures and how they can be used to perform real-time detection of a DDOS attack. We will explain some of the challenges in building real-time architectures, followed by walking through the DDOS detection example and a live demo. This talk is appropriate for anyone interested in Security, IoT, Apache Kafka, Spark, or Hadoop.
Presenter Ryan Bosshart is a Systems Engineer at Cloudera and is the first 3 time presenter at BigDataMadison!
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...AppDynamics
Join this session to hear the details about AppDynamics Unified Analytics, including the latest features and architecture. Gain the information you need to understand how to use your data effectively to improve your software, operations, and business performance, whether you're in DevOps, IT ops, application support, engineering, or product management. Deep dive into architecture and technology and how the product scales.
Key takeaways:
o New features and key technological advances such as advanced searches, smart insight, streaming, and centralized log configuration management
For more information, go to: www.appdynamics.com
Analytics for large-scale time series and event dataAnodot
Time series and event data form the basis for real-time insights about the performance of businesses such as ecommerce, the IoT, and web services, but gaining these insights involves designing a learning system that scales to millions and billions of data streams. In this presentation, Ira Cohen, Anodot cofounder and chief data scientist, outlines such a system that performs real-time machine learning and analytics on streams at massive scale.
Disrupt the static nature of BI with Predictive Anomaly DetectionAnodot
The static nature of BI today results in business insight latency, that cost companies millions of dollars. Data-centric companies like web-based businesses, digital advertising, fintech and IoT need real time business detection to optimize their business performance. In this presentation, Nir Kalish, Sr. Director of Solution Engineering, explains how this can be achieved using Predictive Anomaly Detection. Presented at ODSC West, November 2016.
Logisland is an event mining OpenSource platform based on Kafka/spark to handle huge amount of event, temporal data to find pattern, detect correlation. Useful for log mining in security, fraud detection, IoT, performance & system supervision
This is the talk I gave at the Seattle Spark Meetup in March, 2015. I discussed some Spark Streaming fundamentals, integration points with Kafka, Flume etc.
Cisco's Open Device Programmability Strategy: Open DiscussionCisco DevNet
Cisco DNA is an open and extensible, software-driven architecture built on a set of design principles with the objective of providing:
- Insights & Actions to drive faster business innovation
- Automaton & Assurance to lower IT costs and complexity while meeting business and user expectations
- Security & Compliance to reduce risk as the organization continues to expand and grow. The architecture extends to Cisco network elements.
This session will focus on the open, model-driven, programmable interfaces available across Cisco's network elements which enable you to leverage and extend your network through applications that directly access the routers and switches in your network.
Watch the DevNet 1028 replay from the Cisco Live On-Demand Library at: https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=91041&backBtn=true
Check out more and register for Cisco DevNet: http://ow.ly/jCNV3030OfS
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
Achieving Network Deployment Flexibility with Mirantis OpenStackEric Zhaohui Ji
This is the deck presented for Intel Network Builder.
No longer do we live in a world where you can build your networks around expensive, proprietary pieces of hardware and software. Technology moves so fast that you need to be able to keep up, and that means changing your network on demand. But how can you achieve that kind of flexibility while still maintaining the crucial aspects of performance and reliability?
In this webinar we'll look at the network agility provided by OpenStack, which enables you to gain all of the advantages of software defined networking and Network Functions Virtualization without having to compromise on basic requirements. We'll discuss:
•How Mirantis OpenStack enables enterprise and telecom networking
•The features your OpenStack distribution needs to enable NFV
•Using DPDK and SR-IOV to enhance Virtual Network Function performance
•Achieving a Highly Available OpenStack control plane with Multi-rack deployment
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.