During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics. CONSTANTIN STANCA, Solutions Engineer, Hortonworks and JAMES HUGHES, Mathematician, CCRi
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
Hadoop Distributed File System (HDFS) based architectures allow faster ingestion and processing of larger quantities of time series data than presently possible in current seismic, hydroacoustic, and infrasonic (SHI) analysis platforms. We have developed a data acquisition and signal analysis system using Hadoop, Accumulo, and NiFi. The data model allows individual waveform samples and their associated metadata to be stored in Accumulo. This is a significant departure from traditional storage practices, where continuous waveform segments are stored with their associated metadata as a single entity. Our design allows for rapid table scans of large data archives within Accumulo for locating, retrieving, and analyzing specific waveform segments directly. The scalability of Hadoop permits the system to accommodate the ingestion and analysis of new data as a sensor network grows. Our system is currently acquiring data from over 200 SHI sensors. Peak ingest rates are approaching 500k entries per second, while preserving constant sub-second access times to any range of entries. The average load produced by the data ingest process is consuming less than 10 percent of available system resources. CHARLES HOUCHIN, Computer Scientist, Air Force Technical Applications Center (AFTAC) and JOHN HIGHCOCK, Systems Architect, Hortonworks
ExxonMobil’s journey to unleash time-series data with open source technologyDataWorks Summit
At ExxonMobil, we’ve had hundreds of thousands of sensors collecting data at our refineries and chemical plants all over the world for decades. This data has always been critical to the operations and monitoring of our equipment and processes, but it is more valuable now than ever as we bring new life to our archived and current data at a global scale by leveraging a combination of tools in today’s big data ecosystem.
Regardless of the industry you are in, whether it be energy, manufacturing, transportation, telecom, healthcare, or financial services, only a fraction of the time-series data collected and stored in legacy systems is being utilized to its full potential to improve business performance.
In this session you’ll learn how to navigate some of the challenges associated with time-series data at a global scale, as well as pitfalls to avoid with real-world use cases with an emphasis on the following:
•Nifi: collection and centralization of data from legacy systems scattered all over the globe
•Spark: validation, aggregations, interpolations and more using in memory processing
•Hbase/Hive: partitioning, file formats, and storage strategies relevant to time-series data
•Consumption APIs: empower users to leverage/utilize the analytical tool of their choice through rich APIs. KEVIN BROWN, Big Data Platform Engineer, ExxonMobil
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...DataWorks Summit
American Water will share the success story of American Water’s production use case of leveraging Hadoop and Streaming to ingest and supply de-normalized data from the source transactional systems to end-user applications. It covers the end-to-end flow and the challenges faced.
The data is de-normalized into single subject views at the source to eliminate complex join logic during ingestion into the data lake. Within the views, only timestamps on highly volatile tables have been exposed to give visibility to updates and inserts that have occurred on a table. NiFi ingests the data with a new processor and then stores it in ACID tables in Hive. The custom processor polls the timestamp columns, which generates paginated queries that consists of the delta.
American Water’s use case: Our field employees are our front line with our customers and in the past have felt unable to help customers effectively with our past technologies. One of the largest initiatives is to enable our field employees with accurate and up-to-date information via a new application so they can provide a great customer experience.
Speaker
John Kuchmek, American Water, Sr. Technologist
Adam Michalsky, American Water, Senior Technologist
United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
Data Scientists often have access to very sensitive material: data! Today's data scientists need a way to interact with toxic data where spilling more than a few data could be destructive to a company. Securing compute clusters to be like nuclear glove boxes of old is one technique to limit data exfiltration and ensure data production is regularized, reliable and secure.
This talk will cover the philosophy and implementation of:
Data Dropbox: data goes in blindly but can be verified via checksums - data directionality is enforced; using HDFS is a model and the state of HBase is discussed.
Data Glovebox: one can manipulate data as desired but can not exfiltrate except via very specific, controlled processes; the Oozie Git action is a step in this direction.
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
NetApp receives 70 billion data points of telemetry information each day from its customer’s storage systems. This telemetry data contains configuration information, performance counters, and logs. All of this data is processed using multiple Hadoop clusters, and feeds a machine learning pipeline and a data serving infrastructure that produces insights for customers via an application called Active IQ. We describe the evolution of our Hadoop infrastructure from a traditional on-premises architecture to the hybrid cloud, and lessons learned.
We’ll discuss the insights we are able to produce for our customers, and the techniques used. Finally, we describe the data management challenges with our multi-petabyte Hadoop data lake. We solved these problems by building a unified data lake on-premises and using the NetApp Data Fabric to seamlessly connect to public clouds for data science and machine learning compute resources.
Architecting a truly hybrid cloud implementation allowed NetApp to free up our data scientists to use any software on any cloud, kept the customer log data safe on NetApp Private Storage in Equinix, resulted in faster ability to innovate and release new code and provided flexibility to use any public cloud at the same time with data on NetApp in Equinix.
Speaker
Pranoop Erasani, NetApp, Senior Technical Director, ONTAP
Shankar Pasupathy, NetApp, Technical Director, ACE Engineering
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
Hadoop Distributed File System (HDFS) based architectures allow faster ingestion and processing of larger quantities of time series data than presently possible in current seismic, hydroacoustic, and infrasonic (SHI) analysis platforms. We have developed a data acquisition and signal analysis system using Hadoop, Accumulo, and NiFi. The data model allows individual waveform samples and their associated metadata to be stored in Accumulo. This is a significant departure from traditional storage practices, where continuous waveform segments are stored with their associated metadata as a single entity. Our design allows for rapid table scans of large data archives within Accumulo for locating, retrieving, and analyzing specific waveform segments directly. The scalability of Hadoop permits the system to accommodate the ingestion and analysis of new data as a sensor network grows. Our system is currently acquiring data from over 200 SHI sensors. Peak ingest rates are approaching 500k entries per second, while preserving constant sub-second access times to any range of entries. The average load produced by the data ingest process is consuming less than 10 percent of available system resources. CHARLES HOUCHIN, Computer Scientist, Air Force Technical Applications Center (AFTAC) and JOHN HIGHCOCK, Systems Architect, Hortonworks
ExxonMobil’s journey to unleash time-series data with open source technologyDataWorks Summit
At ExxonMobil, we’ve had hundreds of thousands of sensors collecting data at our refineries and chemical plants all over the world for decades. This data has always been critical to the operations and monitoring of our equipment and processes, but it is more valuable now than ever as we bring new life to our archived and current data at a global scale by leveraging a combination of tools in today’s big data ecosystem.
Regardless of the industry you are in, whether it be energy, manufacturing, transportation, telecom, healthcare, or financial services, only a fraction of the time-series data collected and stored in legacy systems is being utilized to its full potential to improve business performance.
In this session you’ll learn how to navigate some of the challenges associated with time-series data at a global scale, as well as pitfalls to avoid with real-world use cases with an emphasis on the following:
•Nifi: collection and centralization of data from legacy systems scattered all over the globe
•Spark: validation, aggregations, interpolations and more using in memory processing
•Hbase/Hive: partitioning, file formats, and storage strategies relevant to time-series data
•Consumption APIs: empower users to leverage/utilize the analytical tool of their choice through rich APIs. KEVIN BROWN, Big Data Platform Engineer, ExxonMobil
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...DataWorks Summit
American Water will share the success story of American Water’s production use case of leveraging Hadoop and Streaming to ingest and supply de-normalized data from the source transactional systems to end-user applications. It covers the end-to-end flow and the challenges faced.
The data is de-normalized into single subject views at the source to eliminate complex join logic during ingestion into the data lake. Within the views, only timestamps on highly volatile tables have been exposed to give visibility to updates and inserts that have occurred on a table. NiFi ingests the data with a new processor and then stores it in ACID tables in Hive. The custom processor polls the timestamp columns, which generates paginated queries that consists of the delta.
American Water’s use case: Our field employees are our front line with our customers and in the past have felt unable to help customers effectively with our past technologies. One of the largest initiatives is to enable our field employees with accurate and up-to-date information via a new application so they can provide a great customer experience.
Speaker
John Kuchmek, American Water, Sr. Technologist
Adam Michalsky, American Water, Senior Technologist
United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
Data Scientists often have access to very sensitive material: data! Today's data scientists need a way to interact with toxic data where spilling more than a few data could be destructive to a company. Securing compute clusters to be like nuclear glove boxes of old is one technique to limit data exfiltration and ensure data production is regularized, reliable and secure.
This talk will cover the philosophy and implementation of:
Data Dropbox: data goes in blindly but can be verified via checksums - data directionality is enforced; using HDFS is a model and the state of HBase is discussed.
Data Glovebox: one can manipulate data as desired but can not exfiltrate except via very specific, controlled processes; the Oozie Git action is a step in this direction.
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
NetApp receives 70 billion data points of telemetry information each day from its customer’s storage systems. This telemetry data contains configuration information, performance counters, and logs. All of this data is processed using multiple Hadoop clusters, and feeds a machine learning pipeline and a data serving infrastructure that produces insights for customers via an application called Active IQ. We describe the evolution of our Hadoop infrastructure from a traditional on-premises architecture to the hybrid cloud, and lessons learned.
We’ll discuss the insights we are able to produce for our customers, and the techniques used. Finally, we describe the data management challenges with our multi-petabyte Hadoop data lake. We solved these problems by building a unified data lake on-premises and using the NetApp Data Fabric to seamlessly connect to public clouds for data science and machine learning compute resources.
Architecting a truly hybrid cloud implementation allowed NetApp to free up our data scientists to use any software on any cloud, kept the customer log data safe on NetApp Private Storage in Equinix, resulted in faster ability to innovate and release new code and provided flexibility to use any public cloud at the same time with data on NetApp in Equinix.
Speaker
Pranoop Erasani, NetApp, Senior Technical Director, ONTAP
Shankar Pasupathy, NetApp, Technical Director, ACE Engineering
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
Public cloud adoption is exploding and big data technologies are rapidly becoming an important driver of this growth. According to Wikibon, big data public cloud revenue will grow from 4.4% in 2016 to 24% of all big data spend by 2026. Digital transformation initiatives are now a priority for most organizations, with data and advanced analytics at the heart of enabling this change. This is key to driving competitive advantage in every industry.
There is nothing better than a real-world customer use case to help you understand how to get value from big data in the cloud and apply the learnings to your business. Join Microsoft, MapR, and Sullexis on November 10th to:
Hear from Sullexis on the business use case and technical implementation details of one of their oil & gas customers
Understand the integration points of the MapR Platform with other Azure services and why they matter
Know how to deploy the MapR Platform on the Azure cloud and get started easily
You will also get to hear about customer use cases of the MapR Converged Data Platform on Azure in other verticals such as real estate and retail.
Speakers
Rafael Godinho
Technical Evangelist
Microsoft Azure
Tim Morgan
Managing Director
Sullexis
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
Apache Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries. And, it offers the flexibility to run analytics on data stored in Hadoop, across data across object stores and within traditional databases. This makes Spark an ideal platform for accelerating cross-platform analytics on-premises and in the cloud. Building on the success of Spark 1.x release, Spark 2.x delivers major improvements in the areas of API, Performance, and Structured Streaming. In this paper, we will cover a high-level view of the Apache Spark framework, and then focus on what we consider to be very important improvements made in Apache Spark 2.x. We will then share the results of a real-world benchmark effort and share details on Spark and environment configuration changes made to our lab, discuss the results of the benchmark, and provide a reference architecture example for those interested in taking Spark 2.x for their own test drive. This presentation stresses the value of refreshing the Spark 1 with Spark 2 as performance testing results show 2.3x improvement with SparkSQL workloads similar to TPC Benchmark™ DS (TPC-DS). MARK LOCHBIHLER, Principal Architect, Hortonworks and VIPLAVA MADASU, Big Data Systems Engineer, Hewlett Packard Enterprise
We’re living in an era of digital disruption, where the accessibility and adoption of emerging digital technologies are enabling
enterprises to reimagine their businesses in exciting new ways. Data flows from the edge to the core to the cloud while
performing analytics and gaining actionable intelligence at all steps along the way. This connected, automated and data-driven
future enables organizations to rapidly acquire, analyze, and take action on real-time data as well as curate flows for additional
analysis at a later stage. New IoT use cases require enterprises to properly handle data in motion and create newer edge
applications with data flow management, stream processing and analytics while still being governed by existing enterprise
services.
This session highlights the importance of an edge-to-core-to-cloud digital infrastructure that can adapt to your flexing
business needs, capturing expanding data flows at the edge and aligning them to a core infrastructure that can drive insight.
Speakers
Bob Mumford, Hewlett Packard Enterprise, Big Data Solutions Architect
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally. An end to end scenario that demonstrates how this capability can be leveraged will be also demonstrated.
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This slide was created to present the result of my paper about "A Study Review of Common Big Data Architecture for Small-Medium Enterprise" at MSCEIS FPMIPA Universitas Pendidikan Indonesia 2019.
In cooperate with: https://www.linkedin.com/in/faijinali and https://www.linkedin.com/in/fajriabdillah
Kudu as Storage Layer to Digitize Credit ProcessesDataWorks Summit
With HDFS and HBase, there are two different storage options available in the Hadoop ecosystem. Both have their strengths and weaknesses. However, neither HDFS nor HBase can be used universally for all kinds of workloads. Usually this leads to complex hybrid architectures. Kudu is a very versatile storage layer which fills this gap and simplifies the architecture of Big Data systems.
A large German bank is using Kudu as storage layer to fasten their credit processes. Within this system, financial transactions of millions of customers are analysed by Spark jobs to categorize transactions and to calculate key figures. In addition to this analytical workload, several frontend applications are using the Kudu Java API to perform random reads and writes in real-time.
The presentation will cover these topics:
- Business and technical requirements
- Data access patterns
- System architecture
- Kudu data modelling
- Kudu architecture for High Availability
- Experiences from development and operations
Speaker: Olaf Hein, Department Head & Principal Consultant
ORDIX AG
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
Speaker: Russ Savage, from Cask
Big Data Applications Meetup, 09/14/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://youtu.be/4j78g3WvC4Y
About the talk:
As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.
Supporting Hadoop in containers takes much more than the very primitive support Docker provides using the Storage Plugin. A production scale Hadoop deployment inside containers needs to honor anti/affinity, fault-domain and data-locality policies. Kubernetes alone, with primitives such as StatefulSets and PersitentVolumeClaims, is not sufficient to support a complex data-heavy application such as Hadoop. One needs to think about this problem more holistically across containers, networking and storage stacks. Also, constructs around deployment, scaling, upgrade etc in traditional orchestration platforms is designed for applications that have adopted a microservices philosophy, which doesn't fit most Big Data applications across the ingest, store, process, serve and visualization stages of the pipeline. Come to this technical session to learn how to run and manage lifecycle of containerized Hadoop and other applications in the data analytics pipeline efficiently and effectively, far and beyond simple container orchestration. #BigData, #NoSQL, #Hortonworks, #Cloudera, #Kafka, #Tensorflow, #Cassandra, #MongoDB, #Kudu, #Hive, #HBase, PARTHA SEETALA, CTO, Robin Systems.
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics.
Presentation and live demo performed at DataWorks 2018 Conference - San Jose: https://bit.ly/2xthAGD
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
Public cloud adoption is exploding and big data technologies are rapidly becoming an important driver of this growth. According to Wikibon, big data public cloud revenue will grow from 4.4% in 2016 to 24% of all big data spend by 2026. Digital transformation initiatives are now a priority for most organizations, with data and advanced analytics at the heart of enabling this change. This is key to driving competitive advantage in every industry.
There is nothing better than a real-world customer use case to help you understand how to get value from big data in the cloud and apply the learnings to your business. Join Microsoft, MapR, and Sullexis on November 10th to:
Hear from Sullexis on the business use case and technical implementation details of one of their oil & gas customers
Understand the integration points of the MapR Platform with other Azure services and why they matter
Know how to deploy the MapR Platform on the Azure cloud and get started easily
You will also get to hear about customer use cases of the MapR Converged Data Platform on Azure in other verticals such as real estate and retail.
Speakers
Rafael Godinho
Technical Evangelist
Microsoft Azure
Tim Morgan
Managing Director
Sullexis
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
Apache Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries. And, it offers the flexibility to run analytics on data stored in Hadoop, across data across object stores and within traditional databases. This makes Spark an ideal platform for accelerating cross-platform analytics on-premises and in the cloud. Building on the success of Spark 1.x release, Spark 2.x delivers major improvements in the areas of API, Performance, and Structured Streaming. In this paper, we will cover a high-level view of the Apache Spark framework, and then focus on what we consider to be very important improvements made in Apache Spark 2.x. We will then share the results of a real-world benchmark effort and share details on Spark and environment configuration changes made to our lab, discuss the results of the benchmark, and provide a reference architecture example for those interested in taking Spark 2.x for their own test drive. This presentation stresses the value of refreshing the Spark 1 with Spark 2 as performance testing results show 2.3x improvement with SparkSQL workloads similar to TPC Benchmark™ DS (TPC-DS). MARK LOCHBIHLER, Principal Architect, Hortonworks and VIPLAVA MADASU, Big Data Systems Engineer, Hewlett Packard Enterprise
We’re living in an era of digital disruption, where the accessibility and adoption of emerging digital technologies are enabling
enterprises to reimagine their businesses in exciting new ways. Data flows from the edge to the core to the cloud while
performing analytics and gaining actionable intelligence at all steps along the way. This connected, automated and data-driven
future enables organizations to rapidly acquire, analyze, and take action on real-time data as well as curate flows for additional
analysis at a later stage. New IoT use cases require enterprises to properly handle data in motion and create newer edge
applications with data flow management, stream processing and analytics while still being governed by existing enterprise
services.
This session highlights the importance of an edge-to-core-to-cloud digital infrastructure that can adapt to your flexing
business needs, capturing expanding data flows at the edge and aligning them to a core infrastructure that can drive insight.
Speakers
Bob Mumford, Hewlett Packard Enterprise, Big Data Solutions Architect
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally. An end to end scenario that demonstrates how this capability can be leveraged will be also demonstrated.
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This slide was created to present the result of my paper about "A Study Review of Common Big Data Architecture for Small-Medium Enterprise" at MSCEIS FPMIPA Universitas Pendidikan Indonesia 2019.
In cooperate with: https://www.linkedin.com/in/faijinali and https://www.linkedin.com/in/fajriabdillah
Kudu as Storage Layer to Digitize Credit ProcessesDataWorks Summit
With HDFS and HBase, there are two different storage options available in the Hadoop ecosystem. Both have their strengths and weaknesses. However, neither HDFS nor HBase can be used universally for all kinds of workloads. Usually this leads to complex hybrid architectures. Kudu is a very versatile storage layer which fills this gap and simplifies the architecture of Big Data systems.
A large German bank is using Kudu as storage layer to fasten their credit processes. Within this system, financial transactions of millions of customers are analysed by Spark jobs to categorize transactions and to calculate key figures. In addition to this analytical workload, several frontend applications are using the Kudu Java API to perform random reads and writes in real-time.
The presentation will cover these topics:
- Business and technical requirements
- Data access patterns
- System architecture
- Kudu data modelling
- Kudu architecture for High Availability
- Experiences from development and operations
Speaker: Olaf Hein, Department Head & Principal Consultant
ORDIX AG
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
Speaker: Russ Savage, from Cask
Big Data Applications Meetup, 09/14/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://youtu.be/4j78g3WvC4Y
About the talk:
As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.
Supporting Hadoop in containers takes much more than the very primitive support Docker provides using the Storage Plugin. A production scale Hadoop deployment inside containers needs to honor anti/affinity, fault-domain and data-locality policies. Kubernetes alone, with primitives such as StatefulSets and PersitentVolumeClaims, is not sufficient to support a complex data-heavy application such as Hadoop. One needs to think about this problem more holistically across containers, networking and storage stacks. Also, constructs around deployment, scaling, upgrade etc in traditional orchestration platforms is designed for applications that have adopted a microservices philosophy, which doesn't fit most Big Data applications across the ingest, store, process, serve and visualization stages of the pipeline. Come to this technical session to learn how to run and manage lifecycle of containerized Hadoop and other applications in the data analytics pipeline efficiently and effectively, far and beyond simple container orchestration. #BigData, #NoSQL, #Hortonworks, #Cloudera, #Kafka, #Tensorflow, #Cassandra, #MongoDB, #Kudu, #Hive, #HBase, PARTHA SEETALA, CTO, Robin Systems.
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics.
Presentation and live demo performed at DataWorks 2018 Conference - San Jose: https://bit.ly/2xthAGD
SkyWatch is all about making Earth-observation data digestible and accessible. They believe that creating a single place to bring together the planet’s observational datasets will make new waves in geospatial analytics. In this session, we'll take a look at how companies can take advantage of cloud-native workflows to enable access and analysis across planetary-scale datasets. You’ll hear how SkyWatch leveraged AWS serverless technologies to build a company that transforms petabytes of sensor data from space into useful information. You'll also learn how Sinergise is merging a variety of data streams through products like Sentinel Hub, and creating actionable intelligence for its users.
Operationalizing Machine Learning Using GPU-accelerated, In-database AnalyticsKinetica
Mate Radalj's presentation on how to operationalize machine learning using GPU-accelerated, in-database analytics, given at the Bay Area GPU-Accelerated Computing Meetup on October 19, 2017. Presentation includes use cases and links to demos.
Data systems in NASA?s Earth Science Division are primarily focused on providing stewardship of the products of remote sensing and are manifested as Digital Active Archive Systems. Each Instrument Team has a related Science Team which defines the algorithms and monitors the processing of the output of the instruments to produce the related data products and in an format and standards compliance of them. These teams are influenced also by the research and applied sciences components of the programs, but the primary focus is on proving the ongoing validity of the products. Across the distributed system, every product is different. However, this is not conducive to analytics. NASA?s Advanced Information Systems Technology (AIST) program is developing an entirely new approach to creating Analytic Centers which focus on the scientific investigation and harmonize the data, computing resources and tools to enable and to accelerate scientific discovery. Stay tuned to _nd out how. A major element, in today’s science interests, is the comparison of multi-dimensional datasets; this warrants considerable experimentation in trying to understand how to do so meaningfully and quantitatively; asked another way, \What do you mean by similar?" Uncertainty quantification has evolved considerably in the arenas of data reduction and full physics models; however, the emerging demand for machine learning and other artificial intelligence techniques has failed to keep uncertainty quantification and error propagation in mind and there is considerable work to be done.
Big data analytics and machine intelligence v5.0Amr Kamel Deklel
Why big data
What is big data
When big data is big data
Big data information system layers
Hadoop echo system
What is machine learning
Why machine learning with big data
This presentation covers architectural principles for Software defined "Everything", Microservices - their impact on Azure, a Geo-Spatial Fleet analysis using Spark and HDFS on ESRI and FlowBasedProgramming
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
LocationTech GeoMesa is a project that builds on open-source, distributed databases like Accumulo, HBase, and Cassandra to scale up indexing, querying, and analyzing billions of spatio-temporal data points. GeoMesa uses space-filling curves to index multi-dimensional data in Accumulo, and we'll discuss recent improvements for non-point geometries. Over the two and a half years GeoMesa has been an open-source project, GeoMesa's Accumulo schemas have evolved and our team has had a chance to work through creating and optimizing custom Accumulo iterators. These custom iterators allow for better query performance and interesting aggregations. GeoMesa provides support for distributed processing in Spark via MapReduce input and output formats that extend their Accumulo counterparts. We will discuss the performance benefit gained by reducing the number of default map/Spark tasks created for complex query patterns. The talk will conclude with updates about GeoMesa's integration with Jupyter notebook and improvements to GeoMesa's Spark integration.
– Speaker –
Dr. James Hughes
Mathematician, Commonwealth Computer Research, Inc (CCRi)
Dr. James Hughes is a mathematician at Commonwealth Computer Research, Inc. in Charlottesville, Virginia. He is a core committer for GeoMesa which leverages Accumulo and other distributed database systems to provide distributed computation and query engines. He is a LocationTech committer for GeoMesa, SFCurve, and GeoBench. He serves on the LocationTech Project Management Committee and Steering Committee. Through work with LocationTech and OSGeo projects like GeoTools and GeoServer, he works to build end-to-end solutions for big spatio-temporal problems. He holds a PhD in algebraic topology from the University of Virginia.
— More Information —
For more information see http://www.accumulosummit.com/
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
1. 1
High Performance and Scalable Geospatial
Analytics on Cloud with Open Source
James Hughes – CCRI
Constantin Stanca – Hortonworks
2. 3
Summary
• Loading Geospatial data into the cloud and GeoTools datastores never seems as easy as
it should be. There's sensors network, GPS devices, Twitter streams, FTP servers and all
sorts of other data that you need to parse, convert to SimpleFeatures, and then ingest.
• GeoMesa, NiFi and Spark provides a fully open source solution to ease the pain of
ingesting and analyzing data using ANY GeoTools data store.
• DataPlane Services Cloud Manager (powered by Cloudbreak) helps you to deploy
ephemeral geospatial analytics clusters to support increased computation
requirements, all decoupled from storage.
• We will show how real-time streaming data such as satellite AIS can be ingested and
managed in real-time with NiFi. Also, show how geospatial data stored in S3, HDFS, or
HBase, ORC or Parquet, can be queried at scale using GeoMesa, Spark and Zeppelin.
4. 5
Data Movement & System Complexity with Added Pressure of Big Data
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data
5. 6
If That Was Not Enough …
Spatial Data Types
Points
Locations
Events
Instantaneous
Positions
Lines
Road networks
Voyages
Trips
Trajectories
Polygons
Administrative
Regions
Airspaces
6. 7
If That Was Not Enough …
Spatial Data Relationships
equals
disjoint
intersects
touches
crosses
within
contains
overlaps
7. 8
If That Was Not Enough …
Topology Operations
Algorithms
Convex Hull
Buffer
Validation
Dissolve
Polygonization
Simplification
Triangulation
Voronoi
Linear Referencing
and more...
8
9. 10
Traditional Approach
• GIS, data crunching and web serving were three very separate worlds.
• If a web app wanted access to the analysis there was a long process of ETL, DB work,
imports and exports, and bribing various network and storage people for the resources
you needed.
10. 11
Requirements for a High Performance Geospatial Analytics Platform
• IoT sensors present an opportunity to understand the world right now
• A map of the current state of the world enables faster reactions
• The variety of sensors and data source present data management challenges
• Adding new, varied data sources must be easy
• Big data requires distributed storage / computation and scalable infrastructure
• The data layer has to scale
• Analysis has to be easy
12. 13
How Cloud Helps to Address Geospatial Big Data Challenges
• Challenges:
• Big data problem (derive insights from all data)
• Compute resources when they are needed (easy scale, easy access to data)
• Solution:
• Cloud provides elastically the needed compute resources, all decoupled from the storage, whether
that is an object store, file system or NoSQL.
13. 14
Importance for Geospatial Analytics
• Spatial streaming visualizations and analytics can present near real-time insights
• Decision makers can respond more rapidly when they see live data feeds on a map
• Spatial batch analytics can fuse multiple data sources together to understand a region
• Patterns of life emerge
• Advertisers can plan their next campaigns
• Business can locate their new store sites
14. 15
Cloudbreak
• Cloudbreak can be utilized to address
Geospatial computational capacity needs
• Easily spin auto-scalable clusters for
different workloads and purposes, whether
is a Geospatial Ingest Cluster with NiFi and
GeoMesa, or Geospatial Analytics cluster
with Spark and GeoMesa.
• Data can reside in your object store or even
in a persistent data store.
• These ephemeral clusters can be scheduled
for a period of time or only until the job is
done so you pay only what you use.
16. 17
How GeoMesa Helps with Geospatial Data Type Challenges
• Challenges:
• Vector & raster data
• Geospatial data types
• Solution:
• GeoMesa tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
17. 18
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
18. 19
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
19. 20
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
20. 21
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
21. 22
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
23. 24
How Does HDP/HDF + GeoMesa Stream Data?
• The GeoMesa Kafka DataStore allows data produces to write CRUD messages to a Kafka
topic.
• Consumers off that topic build up an in-memory representation of the current state of
the world.
• This allows for
• live maps,
• real time analytics, and
• complex event processing.
24. 25
How Does HDP/HDF + GeoMesa Persist Data?
GeoMesa integrates with HBase and Accumulo:
• Key structures use space filling curves
• Complex geospatial filters and processing can be
‘pushed down’ using Filters, Coprocessors, and Iterators
GeoMesa’s File System Datastore provides the ability to
store spatio-temporally indexed data on S3 cloud object
store or storage formats like ORC or Parquet.
26. 27
Geo Data in
Motion
(Cloud)
Geo Data
in
Motion
(on-premises)
Geo Data
at Rest
(on-premises)
Edge
Geo Data
Geo Data
in Motion
Edge
Analytics
Geo Data
at Rest
(Cloud)
Edge
Geo Data
Geo Data
at Rest
(on-premises)
Closed
Loop
Analytics
Machine
Learning
Deep
Historical
Analysis
Geospatial Data Flow Transformation with NiFi and GeoMesa
On-Prem
Cloud
Satellite AIS
Spatial Data
27. 28
GeoMesa NiFi
• GeoMesa-NiFi allows you to ingest data into GeoMesa straight from NiFi by leveraging
custom processors.
• NiFi allows you to ingest data into GeoMesa from every source GeoMesa supports and
more.
Data
SimpleFeatureType
Schema
GeoMesa NiFi
Processors enabled datastores
28. 29
GeoMesa NiFi Processors
• PutGeoMesaAccumulo: Ingest data into a GeoMesa Accumulo datastore with a
GeoMesa converter or from geoavro
• PutGeoMesaHBase: Ingest data into a GeoMesa HBase datastore with a GeoMesa
converter or from geoavro
• PutGeoMesaFileSystem: Ingest data into a GeoMesa File System datastore with a
GeoMesa converter or from geoavro
• PutGeoMesaKafka: Ingest data into a GeoMesa Kafka datastore with a GeoMesa
converter or from geoavro
• PutGeoTools: Ingest data into an arbitrary GeoTools datastore using a GeoMesa
converter or avro
• ConvertToGeoAvro: Use a GeoMesa converter to create geoavro
30. 31
How does HDP + GeoMesa analyze geospatial data?
• GeoMesa integrates deeply with Spark to:
• create spatial User Defined Types and User Defined Functions
• (based on LocationTech JTS, a geometry library)
• optimize spatial queries against GeoMesa DataSources
• persist output data back to GeoMesa
• leverage Zeppelin notebooks to allow for rapid innovation and creativity
• Zeppelin allows analysts to visualize results easily
36. 37
Sub-select Data
● Create rough sub-
selection of data
■ Bound by time
■ Bound by bounding
box roughly around
the Gulf of Mexico
● Create a new temporary
view from this sub-
selection
● Cache the data (pull into
memory)
37. 38
Data Exploration
● Query for Tankers in the
Gulf
● Get counts for each type
of Tanker
● Group the counts by day
● Graph counts to see
trends
40. 41
Extra Data
● Pull in Gas price data
○ Acquired from
EIA.gov
○ Two Gas Price
Indexes
■ NYH: New York
Harbor
■ GC: Gulf Coast
● Create temporary view
so we can analyze with
SQL
43. 44
Data Exploration
● Backfill the price data with
the last value to give us day-
continuous data
● Min/Max Normalize gas and
ship counts
● Graph gas prices and ship
counts together