Some of the most popular metric visualization tools work really well for smaller deployments, but have issues dealing with large amounts of data. Timely started after an integration of OpenTSDB with Accumulo failed to meet our needs. Timely can be used to gain visibility into the performance of your networks and hardware, your Hadoop cluster and Accumulo database, and your application. In this talk we will cover the current implementation and APIs, security model, deployment models, and our roadmap.
– Speakers –
Dave Marion
Principal Software Engineer, Vistronix
Dave Marion is a Principal Software Engineer for Vistronix. He has been working on big data projects since 2010 and prior to that worked as a database engineer on large relational database projects. He is a veteran of the U.S. Navy and has a BS in Computer Information Systems from University of Maryland University College. He is a PMC member of Apache Accumulo and contributes to several other Apache projects.
Jim Klucar
Principal Software Engineer, Praxis Engineering
Jim Klucar is a Principal Software Engineer for Praxis Engineering. He has a BS in Electrical Engineering from Pennsylvania State University and a MS in Applied and Computational Mathematics from Johns Hopkins University. After a dozen years of developing high performance radar processing techniques, in 2010 he switched to developing Hadoop-based data warehouse and analysis systems. He has contributed to many open source projects including Apache Accumulo, Mesos and Myriad.
Drew Farris
Senior Associate, Booz Allen Hamilton
Drew Farris is a technology consultant at Booz Allen Hamilton who specializes in distributed computing, information retrieval and machine learning. He's a voting member of the Apache Software Foundation, on the Accumulo PMC and works with the Apache Incubator as a mentor for several projects.
— More Information —
For more information see http://www.accumulosummit.com/
Users trust Accumulo to properly enforce access control over their data, as specified by the visibility field. This trust can be broken by a malicious administrator or malfunctioning server, revealing sensitive information to unauthorized individuals. Our prior work encrypts data in Accumulo to protect its confidentiality from a malicious server, but does not protect against this attack. To address this threat, we have implemented a client-side tool that cryptographically enforces visibility labels in Accumulo.
Our solution is called Cryptographically Enforced Attribute-Based Access Control (CEABAC), and consists of two components: an encryption protocol and a key management system. CEABAC generates a fresh encryption key for each, then encrypts this key based on the cell’s visibility field. To accomplish this, the user must be able to create, store, retrieve, and revoke keys associated with each attribute that can appear in the system. The protocol guarantees that, if keys are distributed appropriately, a client without the appropriate permissions to view a cell cannot decrypt it, even if they receive its ciphertext. In the talk we will discuss the CEABAC protocol, our key management solution, how we implemented it in Accumulo, and future directions for this work.
– Speaker –
Dr. Scott Ruoti
Technical Staff, MIT Lincoln Laboratory
Scott Ruoti is a researcher at MIT Lincoln Laboratory. He graduated from Brigham Young University in 2016 with a Ph.D. in computer science, focusing on security and HCI. Recently, he has been working on cryptographic enforcement of access control in Accumulo.
— More Information —
For more information see http://www.accumulosummit.com/
The rapidly increasing amount of semantic network data today provides a wealth of insight into how entities interact and relate with one another. In order to tap into this valuable source of information, organizations require a secure and scalable repository in which to store and explore these interactions and relationships. In this talk we will discuss Apache Rya, an Accumulo-based graph store capable of storing billions of Resource Description Framework (RDF) triples and providing a rich SPARQL query endpoint for exploring complex subgraph relationships. We will talk about two indexing strategies that Rya uses to address some of the challenges associated with storing and querying large graph datasets. In particular, we will discuss how our SPARQL (SPARQL Protocol and RDF Query Language) query caching framework allows users to greatly improve query performance by storing and incrementally maintaining query results using Apache Fluo. We will also discuss our Accumulo-based entity centric index. Inspired by Facebook’s horizontally partitioned graph index, Unicorn , Apache Rya’s entity centric index is a novel way of storing graphs in Accumulo that draws on document partitioned indexing techniques. This graph partitioning and indexing strategy limits network traffic and enables distributed join processing by utilizing a variation of Accumulo’s IntersectingIterator framework to perform joins server side.
The work presented herein was funded by the Office of Naval Research, under contract # N00014-12-C-0365, supporting this effort.
– Speaker –
Dr. Caleb Meier
Software Engineer, Parsons
Caleb Meier has been a Software Engineer at Parsons Government Services for the last two years. Since joining Parsons, he has investigated and implemented a number of features to improve the query performance of Apache Rya. Caleb earned his Ph.D. in Mathematics from the University of California, San Diego and a B.A. in Mathematics from Yale University. In his spare time he enjoys climbing, biking, playing soccer and spending time with his delightful wife Leslie.
— More Information —
For more information see http://www.accumulosummit.com/
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit
Accumulo requires its users to trust each Accumulo installation with their data — a malicious server or user could easily compromise critical data or learn secrets they are not authorized to access. One particular threat is a malicious Accumulo server compromising data’s integrity, by tampering with query results and returning forged, modified, or incomplete results to a user. In prior work, we implemented a lightweight client-side tool to protect against this kind of threat. We now present improvements to this tool that handle a wider range of attacks by a malicious server and reduce overhead for the client.
In our solution, Accumulo clients use Authenticated Data Structures (ADSs) to verify their range queries’ integrity. ADS metadata is stored in Accumulo, so that after each query, the server must construct a proof that the query has not been tampered with. We use Accumulo iterators to compute these proofs on the server without requiring an unnecessary computational burden from the client. We will present our approach to adding ADSs to Accumulo, our schema for storing the ADS metadata, and opportunities for future work in efficiency and expressiveness.
– Speaker –
Leo St. Amour
Military Fellow, MIT Lincoln Laboratory
Leo St. Amour is a master’s student at Northeastern University and a military fellow at MIT Lincoln Laboratory. He graduated from the United States Military Academy in May 2015, where he worked on a TLS library with enhanced usability and security. In addition to his work on TLS and Accumulo, he is currently working on binary analysis, with a focus on discovering and hardening security properties.
— More Information —
For more information see http://www.accumulosummit.com/
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text.
In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
Cox Communications, one of the largest network providers in the U.S., is primarily focused on ensuring network security and providing better service to customers including:
• Real-time monitoring of IP security traffic to identify and alert the unusual network activities across interfaces within an organization
• Enrich the security team with capabilities to determine the source and destination of traffic, class of service, and the causes of congestion on NetFlow data
Challenges:
Data related to Network Security includes more granular streaming data. The major challenge lies in having an unified platform to perform data cleansing, transformation, analytics and reporting on this huge streaming datasets. With the growing network traffic, there is an exponential growth with the associated data. There is a need for Scalable framework to handle these datasets and derive useful information out of data. Along with data processing, data retrieval also plays a major role for better analysis. Currently Data processing was done in daily batch using manual python scripts and with implementation of custom data structures which were specific to use cases. There was a need for more generic and unified framework to provide automated real time end to end solution to obtain high performing, more granular business results.
Solution:
Automation of this process has opportunities on several fronts, notably, providing consistency, repeat-ability, and modernization of OLAP analytics on enterprise big data platform. Reports can be generated easier and faster with the underlying OLAP engine.
• Modern Big Data Platform provides the necessary tool and infrastructure to land, cleanse, process Real time stream data processing and enriching data using the ecosystem components like Spark, Kafka, Hive
• Impressively faster OLAP analytics using Hive LLAP and Druid Integration
• Simple and faster reporting using Superset
All of the necessary components under one roof of Hortonworks Hadoop Platform.
An end-to-end solution using Big Data platform produced faster and repeatable results with sub second query results.
Value Additions by above solution:
• Deliver ultra-fast SQL analytics that can be consumed from the BI tool by security engineering team to get accelerated business results
• Opportunity for business users to explore and visualize real time streaming datasets with integration for various data sources and build dashboards for different slices
• Capability to run BI queries in just milliseconds over 1TB dataset
• High granular permission model on security datasets that allow intricate rules on accessibility for the datasets
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
This talk will describe the work being done to create connectors for Presto and Apache Spark to read and write data in Phoenix tables. We will describe the new phoenix connector that implements Spark’s DataSource v2 API which will enable customizing and optimizing reads and writes to Phoenix tables.
We will also demo the Presto-phoenix connector, showing how it can be used to federate multiple Phoenix clusters and join Phoenix data with different types of data sources.
We will also describe some in progress work to more tightly integrate with the query optimizers of these frameworks in order to provide table statistics and push down filters, limits and aggregates into Phoenix whenever possible in order to speed up query execution.
Another area being worked on is to provide a way to support bulk loading using HFiles.
Users trust Accumulo to properly enforce access control over their data, as specified by the visibility field. This trust can be broken by a malicious administrator or malfunctioning server, revealing sensitive information to unauthorized individuals. Our prior work encrypts data in Accumulo to protect its confidentiality from a malicious server, but does not protect against this attack. To address this threat, we have implemented a client-side tool that cryptographically enforces visibility labels in Accumulo.
Our solution is called Cryptographically Enforced Attribute-Based Access Control (CEABAC), and consists of two components: an encryption protocol and a key management system. CEABAC generates a fresh encryption key for each, then encrypts this key based on the cell’s visibility field. To accomplish this, the user must be able to create, store, retrieve, and revoke keys associated with each attribute that can appear in the system. The protocol guarantees that, if keys are distributed appropriately, a client without the appropriate permissions to view a cell cannot decrypt it, even if they receive its ciphertext. In the talk we will discuss the CEABAC protocol, our key management solution, how we implemented it in Accumulo, and future directions for this work.
– Speaker –
Dr. Scott Ruoti
Technical Staff, MIT Lincoln Laboratory
Scott Ruoti is a researcher at MIT Lincoln Laboratory. He graduated from Brigham Young University in 2016 with a Ph.D. in computer science, focusing on security and HCI. Recently, he has been working on cryptographic enforcement of access control in Accumulo.
— More Information —
For more information see http://www.accumulosummit.com/
The rapidly increasing amount of semantic network data today provides a wealth of insight into how entities interact and relate with one another. In order to tap into this valuable source of information, organizations require a secure and scalable repository in which to store and explore these interactions and relationships. In this talk we will discuss Apache Rya, an Accumulo-based graph store capable of storing billions of Resource Description Framework (RDF) triples and providing a rich SPARQL query endpoint for exploring complex subgraph relationships. We will talk about two indexing strategies that Rya uses to address some of the challenges associated with storing and querying large graph datasets. In particular, we will discuss how our SPARQL (SPARQL Protocol and RDF Query Language) query caching framework allows users to greatly improve query performance by storing and incrementally maintaining query results using Apache Fluo. We will also discuss our Accumulo-based entity centric index. Inspired by Facebook’s horizontally partitioned graph index, Unicorn , Apache Rya’s entity centric index is a novel way of storing graphs in Accumulo that draws on document partitioned indexing techniques. This graph partitioning and indexing strategy limits network traffic and enables distributed join processing by utilizing a variation of Accumulo’s IntersectingIterator framework to perform joins server side.
The work presented herein was funded by the Office of Naval Research, under contract # N00014-12-C-0365, supporting this effort.
– Speaker –
Dr. Caleb Meier
Software Engineer, Parsons
Caleb Meier has been a Software Engineer at Parsons Government Services for the last two years. Since joining Parsons, he has investigated and implemented a number of features to improve the query performance of Apache Rya. Caleb earned his Ph.D. in Mathematics from the University of California, San Diego and a B.A. in Mathematics from Yale University. In his spare time he enjoys climbing, biking, playing soccer and spending time with his delightful wife Leslie.
— More Information —
For more information see http://www.accumulosummit.com/
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit
Accumulo requires its users to trust each Accumulo installation with their data — a malicious server or user could easily compromise critical data or learn secrets they are not authorized to access. One particular threat is a malicious Accumulo server compromising data’s integrity, by tampering with query results and returning forged, modified, or incomplete results to a user. In prior work, we implemented a lightweight client-side tool to protect against this kind of threat. We now present improvements to this tool that handle a wider range of attacks by a malicious server and reduce overhead for the client.
In our solution, Accumulo clients use Authenticated Data Structures (ADSs) to verify their range queries’ integrity. ADS metadata is stored in Accumulo, so that after each query, the server must construct a proof that the query has not been tampered with. We use Accumulo iterators to compute these proofs on the server without requiring an unnecessary computational burden from the client. We will present our approach to adding ADSs to Accumulo, our schema for storing the ADS metadata, and opportunities for future work in efficiency and expressiveness.
– Speaker –
Leo St. Amour
Military Fellow, MIT Lincoln Laboratory
Leo St. Amour is a master’s student at Northeastern University and a military fellow at MIT Lincoln Laboratory. He graduated from the United States Military Academy in May 2015, where he worked on a TLS library with enhanced usability and security. In addition to his work on TLS and Accumulo, he is currently working on binary analysis, with a focus on discovering and hardening security properties.
— More Information —
For more information see http://www.accumulosummit.com/
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text.
In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
Cox Communications, one of the largest network providers in the U.S., is primarily focused on ensuring network security and providing better service to customers including:
• Real-time monitoring of IP security traffic to identify and alert the unusual network activities across interfaces within an organization
• Enrich the security team with capabilities to determine the source and destination of traffic, class of service, and the causes of congestion on NetFlow data
Challenges:
Data related to Network Security includes more granular streaming data. The major challenge lies in having an unified platform to perform data cleansing, transformation, analytics and reporting on this huge streaming datasets. With the growing network traffic, there is an exponential growth with the associated data. There is a need for Scalable framework to handle these datasets and derive useful information out of data. Along with data processing, data retrieval also plays a major role for better analysis. Currently Data processing was done in daily batch using manual python scripts and with implementation of custom data structures which were specific to use cases. There was a need for more generic and unified framework to provide automated real time end to end solution to obtain high performing, more granular business results.
Solution:
Automation of this process has opportunities on several fronts, notably, providing consistency, repeat-ability, and modernization of OLAP analytics on enterprise big data platform. Reports can be generated easier and faster with the underlying OLAP engine.
• Modern Big Data Platform provides the necessary tool and infrastructure to land, cleanse, process Real time stream data processing and enriching data using the ecosystem components like Spark, Kafka, Hive
• Impressively faster OLAP analytics using Hive LLAP and Druid Integration
• Simple and faster reporting using Superset
All of the necessary components under one roof of Hortonworks Hadoop Platform.
An end-to-end solution using Big Data platform produced faster and repeatable results with sub second query results.
Value Additions by above solution:
• Deliver ultra-fast SQL analytics that can be consumed from the BI tool by security engineering team to get accelerated business results
• Opportunity for business users to explore and visualize real time streaming datasets with integration for various data sources and build dashboards for different slices
• Capability to run BI queries in just milliseconds over 1TB dataset
• High granular permission model on security datasets that allow intricate rules on accessibility for the datasets
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
This talk will describe the work being done to create connectors for Presto and Apache Spark to read and write data in Phoenix tables. We will describe the new phoenix connector that implements Spark’s DataSource v2 API which will enable customizing and optimizing reads and writes to Phoenix tables.
We will also demo the Presto-phoenix connector, showing how it can be used to federate multiple Phoenix clusters and join Phoenix data with different types of data sources.
We will also describe some in progress work to more tightly integrate with the query optimizers of these frameworks in order to provide table statistics and push down filters, limits and aggregates into Phoenix whenever possible in order to speed up query execution.
Another area being worked on is to provide a way to support bulk loading using HFiles.
Big Data security: Facing the challenge by Carlos Gómez at Big Data Spain 2017Big Data Spain
This talk gives a technical and innovative overview of how companies can face the challenge of protecting the data and services that are in their data-centric platform, focusing on three main aspects: implementing network segmentation, managing AAA and securing data processing.
https://www.bigdataspain.org/2017/talk/big-data-security-facing-the-challenge
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
In this session we'll be looking at a number of different organisations who are on their big data cybersecurity journey with Apache Metron, we will take a look at the different usecases they are investigating, the data sources they used, the analytics they performed and in some cases the results they were able to find.
We'll also spend some time talking about the common themes in these projects, there are some common approaches to using Apache Metron as a phased project in a project, we'll review some of the common pitfalls and give some concrete suggestions about the things you should (and shouldn't) do when you're getting started.
Finally we'll try and tackle some of the key FAQ's that come up when people are first investigating the potential usage of Apache Metron in the real world based on over a year of interacting with customers and prospects as they look deeper into Apache Metron to see how it fits in to their cybersecurity portfolio.
Speaker
Dave Russell, Principal Solutions Engineer, Hortonworks
Building Enterprise Grade Applications in Yarn with Apache TwillCask Data
Speaker: Poorna Chandra, from Cask
Big Data Applications Meetup, 07/27/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://www.youtube.com/watch?v=I1GLRXyQlx8
About the talk:
Twill is an Apache incubator project that provides higher level abstraction to build distributed systems applications on YARN. Developing distributed applications using YARN is challenging because it does not provide higher level APIs, and lots of boiler plate code needs to be duplicated to deploy applications. Developing YARN applications is typically done by framework developers, like those familiar with Apache Flink or Apache Spark, who need to deploy the framework in a distributed way.
By using Twill, application developers need only be familiar with the basics of the Java programming model when using the Twill APIs, so they can focus on solving business problems. In this talk I present how Twill can be leveraged and an example of Cask Data Application Platform (CDAP) that heavily uses Twill for resource management.
Add Horsepower to AI/ML streaming Pipeline - Pulsar Summit NA 2021StreamNative
The more time the data science teams spend on model training, the less business value is added because no value is created until that model is deployed in production. Traditional HDD-based systems are not suitable for training, which is very IO intensive due to complex transformations that are involved during data preparation. Moreover, Training is not a one-time process. Trends and patterns in the data keep changing rapidly, hence models need to be retrained to address drift issues to continually improve performance in production. Data scientists often experiment with thousands of models, and speeding up the process has significant business implications.
In this talk, we will cover how you can accelerate an AI/ML pipeline by speeding up data loads using the Aerospike database which leverages its hybrid memory architecture to achieve sub-millisecond read/writes. In a hybrid memory architecture the index is stored in-memory (not persisted), and data is stored on persistent storage (SSD) and read directly from the disk. Disk I/O is not required to access the index. For time-sensitive and high throughput use cases such as fraud detection, you need a transactional database at the edge that can handle high-velocity ingestion and support millions of IOPS. The events are then streamed downstream to your AL/ML platform for training or your inference server for predictions. We will share the reference architecture of a highly performant AI/ML training and inference pipelines consisting of Apache Pulsar, Apache Spark 3.0, Aerospike database, and its Spark and Pulsar connectors. This architecture can be extended to other use cases that demand low latency and high throughput while not blowing your budget.
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...Databricks
How do we get better than good enough? Leveraging NLP techniques, we can determine the general sentiment of a sentence, phrase, or a paragraph of text. We can mine the world of social data to get a sense of what is being said. But, how do you get control of the factors that create happiness? How do you become proactive in making end-users happy? Chatbots, human chats, and conversations are the means we are using to express our ideas to each other. NLP is great for helping us process and understand this data but can fall short. In our session, we will explore how to expand NLP/sentiment analysis to investigate the intense interactions that can occur between humans and humans or humans and robots. We will show how to pinpoint the things that work to improve quality and how to use those data points to measure the effectiveness of chatbots. Learn how we have applied popular NLP frameworks such as NLTK, Stanford CoreNLP and John Snow Labs NLP to financial customer service data. Explore techniques to analyze conversations for actionable insights. Leave with an understanding of how to influence your customers' happiness.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
QuickStart your Sumo Logic service with this exclusive webinar. At these monthly live events you will learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights
Big Data security: Facing the challenge by Carlos Gómez at Big Data Spain 2017Big Data Spain
This talk gives a technical and innovative overview of how companies can face the challenge of protecting the data and services that are in their data-centric platform, focusing on three main aspects: implementing network segmentation, managing AAA and securing data processing.
https://www.bigdataspain.org/2017/talk/big-data-security-facing-the-challenge
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
In this session we'll be looking at a number of different organisations who are on their big data cybersecurity journey with Apache Metron, we will take a look at the different usecases they are investigating, the data sources they used, the analytics they performed and in some cases the results they were able to find.
We'll also spend some time talking about the common themes in these projects, there are some common approaches to using Apache Metron as a phased project in a project, we'll review some of the common pitfalls and give some concrete suggestions about the things you should (and shouldn't) do when you're getting started.
Finally we'll try and tackle some of the key FAQ's that come up when people are first investigating the potential usage of Apache Metron in the real world based on over a year of interacting with customers and prospects as they look deeper into Apache Metron to see how it fits in to their cybersecurity portfolio.
Speaker
Dave Russell, Principal Solutions Engineer, Hortonworks
Building Enterprise Grade Applications in Yarn with Apache TwillCask Data
Speaker: Poorna Chandra, from Cask
Big Data Applications Meetup, 07/27/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://www.youtube.com/watch?v=I1GLRXyQlx8
About the talk:
Twill is an Apache incubator project that provides higher level abstraction to build distributed systems applications on YARN. Developing distributed applications using YARN is challenging because it does not provide higher level APIs, and lots of boiler plate code needs to be duplicated to deploy applications. Developing YARN applications is typically done by framework developers, like those familiar with Apache Flink or Apache Spark, who need to deploy the framework in a distributed way.
By using Twill, application developers need only be familiar with the basics of the Java programming model when using the Twill APIs, so they can focus on solving business problems. In this talk I present how Twill can be leveraged and an example of Cask Data Application Platform (CDAP) that heavily uses Twill for resource management.
Add Horsepower to AI/ML streaming Pipeline - Pulsar Summit NA 2021StreamNative
The more time the data science teams spend on model training, the less business value is added because no value is created until that model is deployed in production. Traditional HDD-based systems are not suitable for training, which is very IO intensive due to complex transformations that are involved during data preparation. Moreover, Training is not a one-time process. Trends and patterns in the data keep changing rapidly, hence models need to be retrained to address drift issues to continually improve performance in production. Data scientists often experiment with thousands of models, and speeding up the process has significant business implications.
In this talk, we will cover how you can accelerate an AI/ML pipeline by speeding up data loads using the Aerospike database which leverages its hybrid memory architecture to achieve sub-millisecond read/writes. In a hybrid memory architecture the index is stored in-memory (not persisted), and data is stored on persistent storage (SSD) and read directly from the disk. Disk I/O is not required to access the index. For time-sensitive and high throughput use cases such as fraud detection, you need a transactional database at the edge that can handle high-velocity ingestion and support millions of IOPS. The events are then streamed downstream to your AL/ML platform for training or your inference server for predictions. We will share the reference architecture of a highly performant AI/ML training and inference pipelines consisting of Apache Pulsar, Apache Spark 3.0, Aerospike database, and its Spark and Pulsar connectors. This architecture can be extended to other use cases that demand low latency and high throughput while not blowing your budget.
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...Databricks
How do we get better than good enough? Leveraging NLP techniques, we can determine the general sentiment of a sentence, phrase, or a paragraph of text. We can mine the world of social data to get a sense of what is being said. But, how do you get control of the factors that create happiness? How do you become proactive in making end-users happy? Chatbots, human chats, and conversations are the means we are using to express our ideas to each other. NLP is great for helping us process and understand this data but can fall short. In our session, we will explore how to expand NLP/sentiment analysis to investigate the intense interactions that can occur between humans and humans or humans and robots. We will show how to pinpoint the things that work to improve quality and how to use those data points to measure the effectiveness of chatbots. Learn how we have applied popular NLP frameworks such as NLTK, Stanford CoreNLP and John Snow Labs NLP to financial customer service data. Explore techniques to analyze conversations for actionable insights. Leave with an understanding of how to influence your customers' happiness.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
QuickStart your Sumo Logic service with this exclusive webinar. At these monthly live events you will learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
Scenic city summit real-time streaming in any and all clouds, hybrid and beyond
24-September-2021. Scenic City Summit. Virtual. Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Apache Pulsar, Apache NiFi, Apache Flink
StreamNative
Tim Spann
https://sceniccitysummit.com/
Big data conference europe real-time streaming in any and all clouds, hybri...Timothy Spann
Biography
Tim Spann is a Principal DataFlow Field Engineer at Cloudera where he works with Apache NiFi, MiniFi, Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Talk
Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive.
Tools:
Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, DJL.ai Apache MXNet.
References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Source Code: https://github.com/tspannhw/MmFLaNK
FLiP Stack
StreamNative
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
Timely was born to visualize and analyze metric data at a scale untenable for existing solutions. We're returning to talk about what we've achieved over the past year, provide a detailed look into production architecture and discuss additional features added within the past year including alerting and support for external analytics.
– Speakers –
Drew Farris
Chief Technologist, Booz Allen Hamilton
Drew Farris is a software developer and technology consultant at Booz Allen Hamilton where he helps his client solve problems related to large scale analytics, distributed computing and machine learning. He is a member of the Apache Software Foundation and a contributing author to Manning Publications’ “Taming Text” and the Booz Allen Hamilton “Field Guide to Data Science”.
Bill Oley
Senior Lead Engineer, Booz Allen Hamilton
Bill Oley is a senior lead software engineer at Booz Allen Hamilton where he helps his clients analyze and solve problems related to large scale data ingest, storage, retrieval, and analysis. He is particularly interested in improving visibility into large scale systems by making actionable metrics scalable and usable. He has 16 years of experience designing and developing fault-tolerant distributed systems that operate on continuous streams of data. He holds a bachelor's degree in computer science from the United States Naval Academy and a master's degree in computer science from The Johns Hopkins University.
— More Information —
For more information see http://www.accumulosummit.com/
Sumo Logic Quickstart Training 10/14/2015Sumo Logic
QuickStart your Sumo Logic service with this exclusive webinar. At these monthly live events you will learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights
Instrumenting and Scaling Databases with EnvoyDaniel Hochman
Every request to a database at Lyft is proxied by Envoy, providing complete visibility into the L3/L4 aspects of database interactions. This allows engineers to easily visualize changes to a database's load profile and pinpoint the root cause if necessary. Lyft has also open-sourced codecs for MongoDB, DynamoDB, and Redis. Protocol codecs in combination with custom filters yield benefits ranging from operation-level observability to horizontal scalability via sharding. Using Envoy for this purpose means that enhancements are implemented once and usable across a polyglot stack. The talk demonstrates Envoy's utility beyond traditional RPC service interactions in the network.
Real time cloud native open source streaming of any data to apache solrTimothy Spann
Real time cloud native open source streaming of any data to apache solr
Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink.
Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.
Hail hydrate! from stream to lake using open sourceTimothy Spann
(VIRTUAL) Hail Hydrate! From Stream to Lake Using Open Source - Timothy J Spann, StreamNative
https://osselc21.sched.com/event/lAPi?iframe=no
A cloud data lake that is empty is not useful to anyone. How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLiP stack enables any data engineer, programmer or analyst to build reusable modules with low or no code. FLiP utilizes Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let's hope everyone is ready to go from 0 to Petabyte hero.
https://osselc21.sched.com/event/lAPi/virtual-hail-hydrate-from-stream-to-lake-using-open-source-timothy-j-spann-streamnative
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...Timothy Spann
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-ramp 2022
As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed.
I will walk through how to get started, some use cases and demos and answer questions.
https://www.devfest-uki.com/schedule
https://linktr.ee/tspannhw
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Accumulo Summit 2016: Timely - Scalable Secure Time Series Database
1. Scalable Secure Time Series Database
https://NationalSecurityAgency.github.io/timely
2. Overview
lBuilt on Apache Accumulo
– Proven Security, Scale & Reliability
lUses Netty for communication protocols
– Widely adopted, easy to integrate
lProvides secure access to labeled data
– Easily customized to meet unique architectures
3. History
lIntegrated OpenTSDB with Apache Accumulo
– Using Eric Newtons shim code
– Seemed to have issues with scale
– FAIL - Could not get past StackOverflowError
•(OpenTSDB issue #334)
lDecided to write it from scratch
– Keep Grafana
– Use Grafana OpenTSDB datasource plugin
lHad something working in 2 weeks
4. Simple Architecture
lInsert data points
lSubscribe to data points
lQuery for aggregated data points
Timely
Ingest Subscribe
Time Series
5. Application Interfaces
lSupports multiple protocols
– udp, tcp, https, websocket
lOperations for storing data
– All protocols, security tag optional
lOperations for working with time series data
– https and websocket
lOperations for subscribing to data
– websocket only
6. Timely Input Format (Text)
lSimple text based on OpenTSDB put format:
put <metric> <timestamp> <value> <tag>[,<tag>...]
lExample
put sys.cpu.idle 1469735914000 25.0 host=s01n04 rack=s01 instance=0
lSupported in all protocols
lviz tag used to label data
– viz=private
7. Timely Input Format (Binary)
lBinary format uses Google FlatBuffers encoding
lIDL file located in the source code
lGenerate client code in multiple languages
lCurrently supported in UDP and TCP protocols
8. Sending Data to Timely
lSend data directly from your application
lCan use existing collection agents:
– OpenTSDB Tcollector
– CollectD
lCan leverage StatsD servers also
– HADOOP-12360 (StatsD Metrics2 sink)
9. Storage Format
lMeta Table
– Stores unique metric and tag information
lMetrics Table
– Stores individual metric data
– Each data point stored N ways, N = # tags
lSeveral bytes to store each key
– Run Length Encoding
– Compression
10. Visualizing Time Series Data
lTimely built to work with Grafana
lTimely App for Grafana
– Drop it into the Grafana plugins directory
– Provides Timely data source
– Integrates security features into Grafana
– Example dashboards provided
11. Timely App – Data Sources
lDefine Timely Data Sources
lTest Connectivity
12. Timely App – Menu Items
•Login to defined data source
lView Metric Names / Tags
13. Timely App – Login
lTop – Login using client certificates
lBottom – Login using username / password
14. Sample Dashboards
lTimely App included dashboards:
– Timely Status
– System Overview
– Hadoop Overview
– Accumulo Overview
22. Subscribing to Data
lSubscription API over WebSocket protocol
– WebSocket is a bi-directional protocol
– Timely uses secure WebSockets (wss)
lCreate connection and subscribe to:
– Data for specific metric names
– Data for a specific time window
– Optionally, data that matches tag names and values
lCan register multiple subscriptions
lRemove subscriptions when appropriate
23. Security - Implementation
lTimely stores the labels provided in the viz tag
– Timely only calls flatten() on the CV for consistent
ordering
lSpring Security enables users to plug in their
authentication mechanism and role provider
lWorkflow:
– User logs into Timely via /login HTTPS endpoint
– User authenticated via Spring Security
– HTTP secure session cookie returned for future API
calls
25. Transport Security
lHTTP Strict Transport Security (HSTS)
– Accessing via http will redirect to HTTPS
– Rule stored in browser for configured time
lHTTPS
lWSS
26. Modes of Operation
lAnonymous access enabled
– Unauthenticated users only see unlabled data
– Authenticated users see what they are allowed
lAnonymous access disabled
– Unauthenticated users receive an error message
– Authenticated users see what they are allowed
27. Roadmap
lSummarization of historical data
lNew Time Series API
– Move away from OpenTSDB API
– Add additional features
lTimely Client
– Make subscribing to data easier
– Enable analytics to be easily written
lEnrichment
– Allow for user supplied information about time series
lSupport Grafana annotations
28. Deploying Timely
lJava 8 required for Accumulo and Timely
lTested with Accumulo 1.7.x and Hadoop 2.6
lStandaloneMode
– Uses Mini Accumulo Cluster
– Useful for development and testing
– Data lost across restarts
lNon-Standalone Mode
– 1+ Timely Servers
29. Deployment #1
lSetup:
– 1 Timely Server
– Accumulo 1.7.1, 26 Tservers on single disk hosts
lTimely server receiving 2.75M metrics/min
l Inserting 20.3M keys/min (338K / sec)
– @10:1 ratio inserted to received
l2.2T keys in the metrics table
– 8.75TB unreplicated
– @ 4.3 bytes per key, ~ 40 bytes per metric
30. Deployments #2
lSetup:
– 2 Timely servers
– Accumulo 1.7.1, 31 TabletServers on single disk
hosts
lTimely servers receiving 10M metrics/minute
lInserting 71M keys/minute (1.18M / sec)
– @ 7:1 ratio inserted to received
l1.91T keys in the metrics table
– 7.47TB unreplicated
– @4.3 bytes per key, ~ 30 bytes per metric