The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
Speaker: Alan Gates, Co-Founder, Hortonworks
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
It’s 2017, and big data challenges are as real as they get. Our customers have petabytes of data living in elastic and scalable commodity storage systems such as Azure Data Lake Store and Azure Blob storage.
One of the central questions today is finding insights from data in these storage systems in an interactive manner, at a fraction of the cost.
Interactive Query leverages [Hive on LLAP] in Apache Hive 2.1, brings the interactivity to your complex data warehouse style queries on large datasets stored on commodity cloud storage.
In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto in Azure HDInsight. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake.
Speaker
Ashish Thapliyal, Principal Program Manager, Microsoft Corp
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
Speakers
Michael Ger, General Manager Manufacturing and Automotive, Hortonworks
Wade Salazar, Solutions Engineer, Hortonworks
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
As the development of semiconductor devices, manufacturing system leads to improve productivity and efficiency for wafer fabrication. Owing to such improvement, the number of wafers yielded from the fabrication process has been rapidly increasing. However, current software systems for semiconductor wafers are not aimed at processing large number of wafers. To resolve this issue, the BISTel (a world-class provider of manufacturing intelligence solutions and services for manufacturers) tries to build several products for big data such as Trace Analyzer (TA) and Map Analyzer (MA) using Apache Spark. TA is to analyze raw trace data from a manufacturing process. It captures details on all variable changes, big and small and give the traces' statistical summary (i.e.: min, max, slope, average, etc.). Several BISTel's customers, which are the top-tier semiconductor company in the world use the TA to analyze the massive raw trace data from their manufacturing process. Especially, TA is able to manage terabytes of data by applying Apache Spark's APIs. MA is an advanced pattern recognition tool that sorts wafer yield maps and automatically identify common yield loss patterns. Also, some semiconductor companies use MA to identify clustering patterns for more than 100,000 wafers, which can be considered as big data in the semiconductor area. This talk will introduce these two products which are developed based on the Apache Spark and present how to handle the large-scale semiconductor data in the aspects of software techniques.
Speakers: Seungchul Lee, Daeyoung Kim
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
Speaker: Alan Gates, Co-Founder, Hortonworks
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
It’s 2017, and big data challenges are as real as they get. Our customers have petabytes of data living in elastic and scalable commodity storage systems such as Azure Data Lake Store and Azure Blob storage.
One of the central questions today is finding insights from data in these storage systems in an interactive manner, at a fraction of the cost.
Interactive Query leverages [Hive on LLAP] in Apache Hive 2.1, brings the interactivity to your complex data warehouse style queries on large datasets stored on commodity cloud storage.
In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto in Azure HDInsight. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake.
Speaker
Ashish Thapliyal, Principal Program Manager, Microsoft Corp
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
Speakers
Michael Ger, General Manager Manufacturing and Automotive, Hortonworks
Wade Salazar, Solutions Engineer, Hortonworks
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
As the development of semiconductor devices, manufacturing system leads to improve productivity and efficiency for wafer fabrication. Owing to such improvement, the number of wafers yielded from the fabrication process has been rapidly increasing. However, current software systems for semiconductor wafers are not aimed at processing large number of wafers. To resolve this issue, the BISTel (a world-class provider of manufacturing intelligence solutions and services for manufacturers) tries to build several products for big data such as Trace Analyzer (TA) and Map Analyzer (MA) using Apache Spark. TA is to analyze raw trace data from a manufacturing process. It captures details on all variable changes, big and small and give the traces' statistical summary (i.e.: min, max, slope, average, etc.). Several BISTel's customers, which are the top-tier semiconductor company in the world use the TA to analyze the massive raw trace data from their manufacturing process. Especially, TA is able to manage terabytes of data by applying Apache Spark's APIs. MA is an advanced pattern recognition tool that sorts wafer yield maps and automatically identify common yield loss patterns. Also, some semiconductor companies use MA to identify clustering patterns for more than 100,000 wafers, which can be considered as big data in the semiconductor area. This talk will introduce these two products which are developed based on the Apache Spark and present how to handle the large-scale semiconductor data in the aspects of software techniques.
Speakers: Seungchul Lee, Daeyoung Kim
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
Over the summer, we introduced the MapR Ecosystem Pack (MEP) which is a natural evolution of our existing software update program that decouples open source ecosystem updates from core platform updates. MEP gives our customers quick access to the latest open source innovations while also ensuring cross-project compatibility in any given MEP version.
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this deck, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role.
Presented by Pramod Immaneni, Principal Architect at DataTorrent and PPMC member Apache Apex, on BrightTALK webinar on Apr 6th, 2016
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache Kudu
ApacheCon North America 2019
https://www.apachecon.com/acna19/s/#/scheduledEvent/1179
Masahiro Ito
OSS Solution Center
Hitachi, Ltd.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
Use Machine Learning to Get the Most out of Your Big Data ClustersDatabricks
Enterprises across all sectors have invested heavily in big data infrastructure (Hadoop, Impala, Spark, Kafka, etc.) to turn data into insights into business value. Clusters are getting bigger, more complex and employing more and more data scientists and engineers. As a result, it is increasingly challenging for Data Ops teams to operate and maintain these clusters to meet business requirements and performance SLAs. For instance, a single SQL query may fail or take a long time to complete for various reasons, such as SQL-level inefficiencies, data skew, missing and stale statistics, pool-level resource configurations, such that a resource-hogging query could impact the entire application stack on that cluster. A critical capability to scale application performance is to do cluster-wide tuning. Examples include: tune the default application configurations so that all applications would benefit from that change, tune the pool-level resource allocations, identify wide-impact issues like slow nodes and too many small files, and many others. Cluster-level tuning requires considering more factors, and has a risk of significantly worsening cluster performance; however, it is often done via trial and error with educated guesswork, if attempted at all. We employ machine learning and AI techniques to make cluster-level tuning easier, more data-driven, and more accurate. This talk will describe our methodology to learn from various sources of data such as the workload, the cluster and pool resources, metastore, etc., and provide recommendations for cluster defaults for application and pool resource configurations. We will also present a case study where a customer applied unravel tuning recommendations and achieved 114% increase in the number of applications running per day while using 47% fewer vCore-Hours and 15% fewer containers.
Speaker: Eric Chu
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
We describe an application of CEP using a microservice-based streaming architecture. We use Drools business rule engine to apply rules in real time to an event stream from IoT traffic sensor data.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.