Presentation at Data Days Texas 2015, in Austin. A deep dive into Spark, Tachyon and Mesos code as well as Atigeo's open source contributions, Jaws, a Spark SQL rest server and a Spark Job Server.
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
The document discusses lessons learned from embedding Cassandra in the xPatterns big data analytics platform. It provides an agenda that includes discussing Cassandra usage in xPatterns, the necessary developments like data modeling optimizations, robust REST APIs, geo-replication, and a demo of exporting to NoSQL APIs. Key lessons learned since Cassandra versions 0.6 to 2.0.6 are also summarized, such as the need for consistent clocks, reducing column families, and monitoring.
Building a big data intelligent application on top of xPatterns using tools that leverage Spark, Shark, Mesos, Tachyon and Cassandra. Jaws, open sourcing our own spark sql restful service and our own contributions to the Spark and Mesos projects, lessons learned
Building an intelligent big data application in 30 minutesClaudiu Barbura
Strata Barcelona presentation slides, a live demo of building an intelligent big data application from a web console. The tools and APIs behind are built on top of Spark, Spark SQL/Shark, Tachyon, Mesos, Cassandra, SolrCloud, iPython and include: ELT pipeline (ingestion and transformation), data warehouse explorer, export to NoSql and generated APIs, export to SolrCloud and generated APIs, predictive model building, training and publishing, dashboard UI, monitoring and instrumentation.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraSpark Summit
Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Strata Singapore 2017 business use case section
"Big Telco Real-Time Network Analytics"
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/62797
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
The document discusses lessons learned from embedding Cassandra in the xPatterns big data analytics platform. It provides an agenda that includes discussing Cassandra usage in xPatterns, the necessary developments like data modeling optimizations, robust REST APIs, geo-replication, and a demo of exporting to NoSQL APIs. Key lessons learned since Cassandra versions 0.6 to 2.0.6 are also summarized, such as the need for consistent clocks, reducing column families, and monitoring.
Building a big data intelligent application on top of xPatterns using tools that leverage Spark, Shark, Mesos, Tachyon and Cassandra. Jaws, open sourcing our own spark sql restful service and our own contributions to the Spark and Mesos projects, lessons learned
Building an intelligent big data application in 30 minutesClaudiu Barbura
Strata Barcelona presentation slides, a live demo of building an intelligent big data application from a web console. The tools and APIs behind are built on top of Spark, Spark SQL/Shark, Tachyon, Mesos, Cassandra, SolrCloud, iPython and include: ELT pipeline (ingestion and transformation), data warehouse explorer, export to NoSql and generated APIs, export to SolrCloud and generated APIs, predictive model building, training and publishing, dashboard UI, monitoring and instrumentation.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraSpark Summit
Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Strata Singapore 2017 business use case section
"Big Telco Real-Time Network Analytics"
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/62797
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points:
- SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes.
- Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL.
- The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick
Tim Spann will present on learning Apache Spark. He is a senior solutions architect who previously worked as a senior field engineer and startup engineer. airis.DATA, where Spann works, specializes in machine learning and graph solutions using Spark, H20, Mahout, and Flink on petabyte datasets. The agenda includes an overview of Spark, an explanation of MapReduce, and hands-on exercises to install Spark, run a MapReduce job locally, and build a project with IntelliJ and SBT.
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Spark Summit
Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?
Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
The document discusses powering predictive mapping at scale using the SMACK stack, which includes Spark, Kafka, and Elasticsearch. It describes how the SMACK stack can ingest millions of events per second from connected devices, store the data in Apache Spark, and allow real-time and batch processing of the data. It also provides an example of using the stack for real-time tracking of geo-enabled IoT devices and demonstrates the data flow and a demo of the system.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
The document discusses the Spark Operator, which allows deploying, managing, and monitoring Spark clusters on Kubernetes. It describes how the operator extends Kubernetes by defining custom resources and reacting to events from those resources, such as SparkCluster, SparkApplication, and SparkHistoryServer. The operator takes care of common tasks to simplify running Spark on Kubernetes and hides the complexity through an abstract operator library.
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
- Elastic provides a search and analytics platform called the Elastic Stack that includes the Elastic Stack, Beats data shippers, and Kibana analytics and visualization tools.
- The presentation discussed updates to Elastic's products including performance improvements to search, new features for distributed search across data centers, and enhanced security options for authentication and authorization.
- Elastic aims to provide customizable and extensible solutions for users to ingest, store, search, analyze and visualize large volumes of data from various sources.
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
Predictive intelligence from machine learning has the potential to change everything in our day to day experiences, from education to entertainment, from travel to healthcare, from business to leisure and everything in between. Modern ML frameworks are batch by nature and cannot pivot on the fly to changing user data or situations. Many simple ML applications such as those that enhance the user experience, can benefit from real-time robust predictive models that adapt on the fly.
Join this session to learn how common practices in machine learning such as running a trained model in production can be substantially accelerated and radically simplified by using Redis modules that natively store and execute common models generated by Spark ML and Tensorflow algorithms. We will also discuss the implementation of simple, real-time feed-forward neural networks with Neural Redis and scenarios that can benefit from such efficient, accelerated artificial intelligence.
Real-life implementations of these new techniques at a large consumer credit company for fraud analytics, at an online e-commerce provider for user recommendations and at a large media company for targeting content will also be discussed.
Automated Metadata Management in Data Lake – A CI/CD Driven ApproachDatabricks
We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.
In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.
Key focus will be:
Need for metadata management tool in a data lake
Architecture and Design of the tool
Maintaining information on databases/tables/views like schema, owner, PII, description etc in simple to understand yml structure
Live demo of creating a new table with CI/CD promotion to production
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
Kylin is an open source distributed analytics engine contributed by eBay that provides a SQL interface and OLAP on Hadoop supporting extremely large datasets. Kylin's pre-built MOLAP cubes (stored in HBase), distributed architecture, and high concurrency helps users analyze multidimensional queries via SQL and other BI tools. During this session, you'll learn how Kylin uses HBase's key-value store to serve SQL queries with relational schema.
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
Cybercrime is big business. Gartner reports worldwide security spending at $80B, with annual losses totalling more than $1.2T (in 2015). Small to medium sized businesses now account for more than half of the attacks targeting enterprises today. The threat actors behind these attacks are continually shifting their techniques and toolkits to evade the security defenses that businesses commonly use. Thanks to the growing frequency and complexity of attacks, the task of identifying and mitigating security-related events has become increasingly difficult.
At eSentire, we use a combination of data and human analytics to identify, respond to and mitigate cyber threats in real-time. We capture all network traffic on our customers’ networks, hence ingesting a large amount of time-series data. We process the data as it is being streamed into our system to extract relevant threat insights and block attacks in real-time. Furthermore, we enable our cybersecurity analysts to perform in-depth investigations to: i) confirm attacks and ii) identify threats that analytical models miss. Having security experts in the loop provides feedback to our analytics engine, thereby improving the overall threat detection effectiveness.
So how exactly can you build an analytics pipeline to handle a large amount of time-series/event-driven data? How do you build the tools that allow people to query this data with the expectation of mission-critical response times?
In this presentation, William Callaghan will focus on the challenges faced and lessons learned in building a human-in-the loop cyber threat analytics pipeline. They will discuss the topic of analytics in cybersecurity and highlight the use of technologies such as Spark Streaming/SQL, Cassandra, Kafka and Alluxio in creating an analytics architecture with missions-critical response times.
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
The effective use of big data is the key to gaining a competitive advantage and outperforming the competition. This change demands that companies consume and blend enormous amount of data created from divergent and inherently mismatched sources, which represents a paradigm shift to the traditional data warehouse.
Companies need to modernize their data warehouse, augmenting it with a platform that allows storage, processing, exploration and analysis of large and diverse datasets without limiting the ability to deliver the data access, and flexibility responding to the needs of the business. That’s where Oracle Cloud and Qubole work together delivering a new breed of data platform —capable of storing and processing the overwhelming amount of data that on-premises big data deployments cannot handle.
Watch this on-demand webinar to understand:
- Why deploying big data on-premises is expensive, complex to maintain and limits your ability to scale across new use cases and data sources
- How Oracle Bare Metal Cloud's predictable and fast performance compute and network services deliver the foundation of a cost-effective, high-performance big data platform
- How Qubole leverages Oracle Bare Metal Cloud to provide a turnkey big data service that optimizes cost, performance, and scale, enabling self-service data exploration.
Qubole delivers a cloud-based, turnkey, self-service big data service that removes the complexity and reduces the cost of doing big data. It leverages Oracle Bare Metal Cloud’s next generation of scalable, inexpensive and performant compute, network and storage public cloud infrastructure to provide a solution that accelerates time to market and reduces the risk of your big data initiatives.
This document discusses Azure HDInsight, a managed Apache Hadoop and Spark platform. It provides a secure environment for building data lakes in the cloud. Key capabilities include ingesting and analyzing data from various sources using technologies like Apache Spark, Hive, Kafka and HBase. It also discusses data storage options, performance, security features and tools for management and monitoring of HDInsight clusters.
Rajat Venkatesh from Qubole presented on Quark, a virtualization engine for analytics. Quark uses a multi-store architecture to optimize queries using materialized views, predicate injection, and denormalized/sorted tables. It supports multiple SQL and storage engines. The roadmap includes improvements to the cost-based optimizer, support for OLAP cubes, and developing Quark as a service. Coordinates for the Quark GitHub and mailing list were provided.
Today, many companies are faced with a huge quantity of data and a wide variety of tools with which to process it. This potentially allows for great opportunities to satisfy customers’ needs and bring user experience to the next level. However, in order to achieve this and provide a competitive solution, sophisticated and complex data processing is needed. Such processing can rarely be done with one tool or framework — a number of tools are often involved, each having prowess in a particular field of the processing pipeline.
In this session, we will see the latest endeavors of Apache Ignite to integrate with other big data platforms and provide its in-memory computing strengths for data processing pipelines. In particular we will have a closer look at how it can be integrated and used with Apache Kafka and/or Flume, and outline several use scenarios.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
Aggregation based features account for a quarter of the several 1000s features used by the ML-based decisioning system by the Risk team at Uber. We observed several repetitive, cumbersome steps needed for onboarding a feature, every single time. Therefore, to accelerate developer velocity, and to enable Feature Engineering at scale, we decided to develop a generic spark based infrastructure to simplify the process to no more than a simple spec file, containing a parameterized query, along with some metadata on where the feature should be aggregated and stored.
In the presentation, we will describe the architecture of the final solution, highlighting some of the advanced capabilities like backfill support and self-healing for correctness. We will showcase how, using data stored in Hive and using Spark, we developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, we ensured that only entities for which there is actual value changes are dispersed to our real-time access store (Cassandra). We will share how we did data modeling in Cassandra using its native capabilities such as counters, and how we worked around some of the limitations of Cassandra. We will also cover the details about the access service how we do different types of feature stitching together. How, based on our data model we were able to ensure that all the feature for an entity with the same aggregation window, were queried via a single query. Finally, we will cover some of the details on how these incremental aggregated features have enabled shorter turnaround times for the models using such features.
Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events...Databricks
Continuous integration (CI) pipelines generate massive amounts of messy log data. At Pure Storage engineering, we run over 65,000 tests per day creating a large triage problem. Spark’s flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline. Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and re-indexes old data for newly encoded patters (Batch job). Previous work on a mixed streaming and batch environment describes the options for persisting data and their trade-offs:
1) short interval buckets which hurts batch performance
2) long interval buckets which increases micro batch time windows
3) additional software on the background to compact the short interval buckets which adds complexity.
This talk will go over how we use the filesystem metadata of our disaggregated compute and storage layers to write over half a million files per day of varied sizes from 52 Billion events and have efficient batch jobs without compaction that allow us to process over 40TB per hour. We will go over the challenges and best practices to achieve efficiency in this mixed environment scenarios.
This document outlines the agenda for a Tachyon Meetup in San Francisco. The agenda includes discussing the xPatterns architecture, BDAS++, demos of Tachyon internals and APIs, and lessons learned. BDAS++ refers to enhancements made to Tachyon to support Spark SQL and the Spark job server. Lessons learned focus on issues discovered like partial in-memory file storage bugs and best practices for Tachyon usage.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points:
- SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes.
- Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL.
- The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick
Tim Spann will present on learning Apache Spark. He is a senior solutions architect who previously worked as a senior field engineer and startup engineer. airis.DATA, where Spann works, specializes in machine learning and graph solutions using Spark, H20, Mahout, and Flink on petabyte datasets. The agenda includes an overview of Spark, an explanation of MapReduce, and hands-on exercises to install Spark, run a MapReduce job locally, and build a project with IntelliJ and SBT.
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Spark Summit
Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?
Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
The document discusses powering predictive mapping at scale using the SMACK stack, which includes Spark, Kafka, and Elasticsearch. It describes how the SMACK stack can ingest millions of events per second from connected devices, store the data in Apache Spark, and allow real-time and batch processing of the data. It also provides an example of using the stack for real-time tracking of geo-enabled IoT devices and demonstrates the data flow and a demo of the system.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
The document discusses the Spark Operator, which allows deploying, managing, and monitoring Spark clusters on Kubernetes. It describes how the operator extends Kubernetes by defining custom resources and reacting to events from those resources, such as SparkCluster, SparkApplication, and SparkHistoryServer. The operator takes care of common tasks to simplify running Spark on Kubernetes and hides the complexity through an abstract operator library.
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
- Elastic provides a search and analytics platform called the Elastic Stack that includes the Elastic Stack, Beats data shippers, and Kibana analytics and visualization tools.
- The presentation discussed updates to Elastic's products including performance improvements to search, new features for distributed search across data centers, and enhanced security options for authentication and authorization.
- Elastic aims to provide customizable and extensible solutions for users to ingest, store, search, analyze and visualize large volumes of data from various sources.
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
Predictive intelligence from machine learning has the potential to change everything in our day to day experiences, from education to entertainment, from travel to healthcare, from business to leisure and everything in between. Modern ML frameworks are batch by nature and cannot pivot on the fly to changing user data or situations. Many simple ML applications such as those that enhance the user experience, can benefit from real-time robust predictive models that adapt on the fly.
Join this session to learn how common practices in machine learning such as running a trained model in production can be substantially accelerated and radically simplified by using Redis modules that natively store and execute common models generated by Spark ML and Tensorflow algorithms. We will also discuss the implementation of simple, real-time feed-forward neural networks with Neural Redis and scenarios that can benefit from such efficient, accelerated artificial intelligence.
Real-life implementations of these new techniques at a large consumer credit company for fraud analytics, at an online e-commerce provider for user recommendations and at a large media company for targeting content will also be discussed.
Automated Metadata Management in Data Lake – A CI/CD Driven ApproachDatabricks
We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.
In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.
Key focus will be:
Need for metadata management tool in a data lake
Architecture and Design of the tool
Maintaining information on databases/tables/views like schema, owner, PII, description etc in simple to understand yml structure
Live demo of creating a new table with CI/CD promotion to production
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
Kylin is an open source distributed analytics engine contributed by eBay that provides a SQL interface and OLAP on Hadoop supporting extremely large datasets. Kylin's pre-built MOLAP cubes (stored in HBase), distributed architecture, and high concurrency helps users analyze multidimensional queries via SQL and other BI tools. During this session, you'll learn how Kylin uses HBase's key-value store to serve SQL queries with relational schema.
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
Cybercrime is big business. Gartner reports worldwide security spending at $80B, with annual losses totalling more than $1.2T (in 2015). Small to medium sized businesses now account for more than half of the attacks targeting enterprises today. The threat actors behind these attacks are continually shifting their techniques and toolkits to evade the security defenses that businesses commonly use. Thanks to the growing frequency and complexity of attacks, the task of identifying and mitigating security-related events has become increasingly difficult.
At eSentire, we use a combination of data and human analytics to identify, respond to and mitigate cyber threats in real-time. We capture all network traffic on our customers’ networks, hence ingesting a large amount of time-series data. We process the data as it is being streamed into our system to extract relevant threat insights and block attacks in real-time. Furthermore, we enable our cybersecurity analysts to perform in-depth investigations to: i) confirm attacks and ii) identify threats that analytical models miss. Having security experts in the loop provides feedback to our analytics engine, thereby improving the overall threat detection effectiveness.
So how exactly can you build an analytics pipeline to handle a large amount of time-series/event-driven data? How do you build the tools that allow people to query this data with the expectation of mission-critical response times?
In this presentation, William Callaghan will focus on the challenges faced and lessons learned in building a human-in-the loop cyber threat analytics pipeline. They will discuss the topic of analytics in cybersecurity and highlight the use of technologies such as Spark Streaming/SQL, Cassandra, Kafka and Alluxio in creating an analytics architecture with missions-critical response times.
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
The effective use of big data is the key to gaining a competitive advantage and outperforming the competition. This change demands that companies consume and blend enormous amount of data created from divergent and inherently mismatched sources, which represents a paradigm shift to the traditional data warehouse.
Companies need to modernize their data warehouse, augmenting it with a platform that allows storage, processing, exploration and analysis of large and diverse datasets without limiting the ability to deliver the data access, and flexibility responding to the needs of the business. That’s where Oracle Cloud and Qubole work together delivering a new breed of data platform —capable of storing and processing the overwhelming amount of data that on-premises big data deployments cannot handle.
Watch this on-demand webinar to understand:
- Why deploying big data on-premises is expensive, complex to maintain and limits your ability to scale across new use cases and data sources
- How Oracle Bare Metal Cloud's predictable and fast performance compute and network services deliver the foundation of a cost-effective, high-performance big data platform
- How Qubole leverages Oracle Bare Metal Cloud to provide a turnkey big data service that optimizes cost, performance, and scale, enabling self-service data exploration.
Qubole delivers a cloud-based, turnkey, self-service big data service that removes the complexity and reduces the cost of doing big data. It leverages Oracle Bare Metal Cloud’s next generation of scalable, inexpensive and performant compute, network and storage public cloud infrastructure to provide a solution that accelerates time to market and reduces the risk of your big data initiatives.
This document discusses Azure HDInsight, a managed Apache Hadoop and Spark platform. It provides a secure environment for building data lakes in the cloud. Key capabilities include ingesting and analyzing data from various sources using technologies like Apache Spark, Hive, Kafka and HBase. It also discusses data storage options, performance, security features and tools for management and monitoring of HDInsight clusters.
Rajat Venkatesh from Qubole presented on Quark, a virtualization engine for analytics. Quark uses a multi-store architecture to optimize queries using materialized views, predicate injection, and denormalized/sorted tables. It supports multiple SQL and storage engines. The roadmap includes improvements to the cost-based optimizer, support for OLAP cubes, and developing Quark as a service. Coordinates for the Quark GitHub and mailing list were provided.
Today, many companies are faced with a huge quantity of data and a wide variety of tools with which to process it. This potentially allows for great opportunities to satisfy customers’ needs and bring user experience to the next level. However, in order to achieve this and provide a competitive solution, sophisticated and complex data processing is needed. Such processing can rarely be done with one tool or framework — a number of tools are often involved, each having prowess in a particular field of the processing pipeline.
In this session, we will see the latest endeavors of Apache Ignite to integrate with other big data platforms and provide its in-memory computing strengths for data processing pipelines. In particular we will have a closer look at how it can be integrated and used with Apache Kafka and/or Flume, and outline several use scenarios.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
Aggregation based features account for a quarter of the several 1000s features used by the ML-based decisioning system by the Risk team at Uber. We observed several repetitive, cumbersome steps needed for onboarding a feature, every single time. Therefore, to accelerate developer velocity, and to enable Feature Engineering at scale, we decided to develop a generic spark based infrastructure to simplify the process to no more than a simple spec file, containing a parameterized query, along with some metadata on where the feature should be aggregated and stored.
In the presentation, we will describe the architecture of the final solution, highlighting some of the advanced capabilities like backfill support and self-healing for correctness. We will showcase how, using data stored in Hive and using Spark, we developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, we ensured that only entities for which there is actual value changes are dispersed to our real-time access store (Cassandra). We will share how we did data modeling in Cassandra using its native capabilities such as counters, and how we worked around some of the limitations of Cassandra. We will also cover the details about the access service how we do different types of feature stitching together. How, based on our data model we were able to ensure that all the feature for an entity with the same aggregation window, were queried via a single query. Finally, we will cover some of the details on how these incremental aggregated features have enabled shorter turnaround times for the models using such features.
Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events...Databricks
Continuous integration (CI) pipelines generate massive amounts of messy log data. At Pure Storage engineering, we run over 65,000 tests per day creating a large triage problem. Spark’s flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline. Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and re-indexes old data for newly encoded patters (Batch job). Previous work on a mixed streaming and batch environment describes the options for persisting data and their trade-offs:
1) short interval buckets which hurts batch performance
2) long interval buckets which increases micro batch time windows
3) additional software on the background to compact the short interval buckets which adds complexity.
This talk will go over how we use the filesystem metadata of our disaggregated compute and storage layers to write over half a million files per day of varied sizes from 52 Billion events and have efficient batch jobs without compaction that allow us to process over 40TB per hour. We will go over the challenges and best practices to achieve efficiency in this mixed environment scenarios.
This document outlines the agenda for a Tachyon Meetup in San Francisco. The agenda includes discussing the xPatterns architecture, BDAS++, demos of Tachyon internals and APIs, and lessons learned. BDAS++ refers to enhancements made to Tachyon to support Spark SQL and the Spark job server. Lessons learned focus on issues discovered like partial in-memory file storage bugs and best practices for Tachyon usage.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
This document describes an autonomous analytics platform that allows users to analyze streaming data. The platform uses a unified big data technology stack including Spark, Cassandra, Hadoop, Kafka and Elasticsearch. It has a cloud-agnostic architecture and supports multiple machine learning frameworks. The platform includes a Domain Specific Language (DSL) that allows power users to create full data pipelines and analytics workflows with a few lines of code. It also includes a DSL Workbench for interactively building, editing and publishing analytical pipelines. Additionally, the document introduces "Auto Curious", which harnesses user interactions to autonomously discover insights and compose DSL commands through a question graph interface.
This document outlines the agenda and content for a presentation on xPatterns, a tool that provides APIs and tools for ingesting, transforming, querying and exporting large datasets on Apache Spark, Shark, Tachyon and Mesos. The presentation demonstrates how xPatterns has evolved its infrastructure to leverage these big data technologies for improved performance, including distributed data ingestion, transformation APIs, an interactive Shark query server, and exporting data to NoSQL databases. It also provides examples of how xPatterns has been used to build applications on large healthcare datasets.
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Steve Kramer
Paragon Science used a combination of network analysis, community detection, topic detection, sentiment analysis, and anomaly detection methods to find key influencers and viral topics in two recent Twitter data sets: one of 7.9 M tweets regarding ISIS, a second set consisting of more than 117 M tweets about the 2016 primary elections, and a third set of 7M tweets realted to Brexit.
Paragon Science's patented dynamic anomaly detection technology is based on methods drawn from dynamical systems and chaos theory. In particular, we can calculate finite-time Lyapunov exponents from any time-dependent data stream to find the clusters of entities that are behaving most chaotically compared to the rest of the data set. Because we do not have to specify normal vs. abnormal behavior in advance, no machine learning per se is required. In a robust fashion that is tolerant of missing or erroneous data, we can identify the "unknown unknowns" that can represent threats to be mitigate or opportunities to be seized. To date, our technique has been applied successfully to a broad range of industry verticals, including healthcare data (Advisory Board Company), web user behavior data (Vast), mobile phone data (Place IQ), vehicle pricing analytics (Digital Motorworks/CDK Global), online coupon data (RetailMeNot), email monitoring for patent law cases, and social media monitoring.
GLAICE es un agua mineral natural de Galicia que utiliza un sistema bio-reactivo exclusivo para ionizar el agua y hacerla alcalina. Proporciona agua pura y con beneficios para la salud, directamente desde el corazón de Galicia.
Este documento presenta una variedad de temas relacionados con el deporte en Colombia, incluyendo un álbum de fotos, multimedia, una línea de tiempo, cómics y publicaciones sobre la historia y cultura del deporte en el país. También incluye un blog sobre noticias y eventos deportivos actuales.
This document discusses using Azure Batch for high performance computing and provides an overview of its key concepts and components. Azure Batch allows scaling compute-intensive workloads across a managed cluster of virtual machines. It is well-suited for applications that can be parallelized by breaking work into independent tasks. The document outlines Azure Batch constructs like pools, jobs, and tasks. It also provides examples of how tasks are distributed across nodes and queued based on priority and resource availability. A use case of parallel data file loading using Azure Batch is presented.
Recently, in the fields Business Intelligence and Data Management, everybody is talking about data science, machine learning, predictive analytics and many other “clever” terms with promises to turn your data into gold. In this slides, we present the big picture of data science and machine learning. First, we define the context for data mining from BI perspective, and try to clarify various buzzwords in this field. Then we give an overview of the machine learning paradigms. After that, we are going to discuss - at a high level - the various data mining tasks, techniques and applications. Next, we will have a quick tour through the Knowledge Discovery Process. Screenshots from demos will be shown, and finally we conclude with some takeaway points.
Cloud Foundry Introduction for CF Meetup Tokyo March 2016Tomohiro Ichimura
Tomohiro Ichimura is a senior solution architect at Pivotal Japan. He introduced Cloud Foundry, an open source platform as a service. Over 50 corporations contribute to Cloud Foundry, which has over 21,000 members. Cloud Foundry provides rapid application development and deployment across public and private clouds. It offers developer services, continuous integration/delivery, and multi-cloud portability through components like BOSH, Elastic Runtime, and Operations Manager.
SpringOne Platform 2016
Speakers: Neville George; Principal Engineer, Comcast & Sergey Matochkin; Principal Architect, Comcast
Over the course of the last year, Comcast has matured its Cloud Foundry platform from proof-of-concept to production ready. The platform currently supports some of our most critical applications while also being an incubator for more innovation. Transitioning to a new platform is never easy and we have had to win over skeptics with operational excellence. Join us to hear about our experience with:
-Reducing Time to Market for new applications and services with PaaS
-Enabling DevOps with Cloud Foundry PaaS
-Extending Pivotal Cloud Foundry with new capabilities to meet DevOps needs
Enterprise Cloud Data Platforms - with Microsoft AzureKhalid Salama
These slides gives an overview on MS Azure Data Architecture and Services, including Data Lake Analytics, Data Factory, Azure SQL DW, Stream Analytics, Azure Machine learning tools, and Data Catalog. This is also known as Cortana Analytical Suite
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Apache Deep Learning 201 - Philly Open SourceTimothy Spann
#phillyopensource
Introduction talk for data engineers for deep learning on apache with apache mxnet, apache nifi, apache hive, apache hadoop, apache spark, python and other tools.
Streaming Solutions for Real time problemsAbhishek Gupta
The document is a presentation on streaming solutions for real-time problems using Apache Kafka, Kafka Streams, and Redis. It begins with an introduction and overview of the technologies. It then presents a sample monitoring application using metrics from multiple machines as a use case. The presentation demonstrates how to implement this application using Kafka as the event store, Kafka Streams for processing, and Redis as the state store. It also shows how to deploy the application components on Oracle Cloud.
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
The document describes an in-memory data pipeline and warehouse using Spark, Spark SQL, Tachyon and Parquet. It involves ingesting financial transaction data from S3, transforming the data through cleaning and joining steps, and building a data warehouse using Spark SQL and Parquet for querying. Key aspects covered include distributing metadata lookups, balancing data partitions, broadcasting joins to avoid skew, caching data in Tachyon and Jaws for a RESTful interface to Spark SQL.
Caching is a frequently used and misused technique for speeding up performance, off-loading non-scalable or expensive infrastructure, scaling systems and coping with large processing peaks. In this talk Greg introduces you to the theory of caching and highlights key things to keep in mind when you apply caching. Then we take a comprehensive look at how the JCache standard standardises Java usage of caching.
This document provides an introduction and overview of Apache NiFi 1.11.4. It discusses new features such as improved support for partitions in Azure Event Hubs, encrypted repositories, class loader isolation, and support for IBM MQ and the Hortonworks Schema Registry. It also summarizes new reporting tasks, controller services, and processors. Additional features include JDK 11 support, encrypted repositories, and parameter improvements to support CI/CD. The document provides examples of using NiFi with Docker, Kubernetes, and in the cloud. It concludes with useful links for additional NiFi resources.
Caching is a frequently used and misused technique for speeding up performance, off-loading non-scalable or expensive infrastructure, scaling systems and coping with large processing peaks. In this talk Greg introduces you to caching and highlights the key caching theory points that you should consider in applying caching. Then we take a comprehensive look at the new JCache standard standardises Java usage of caching.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Slides form Config Management Camp, looking at how you can take a collaborative GitFlow approach to Terraform using Remote State, Modules and Dynamically Generated Credentials using Vault
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Arun Gupta
Arun Gupta presented on running Java EE 6 applications in the cloud. He discussed Java EE 6 support on various cloud platforms including Amazon, RightScale, Elastra, and Joyent. He also compared features of different cloud vendors and how Java EE can evolve to better support cloud computing. Gupta concluded that Java EE 6 applications can easily be deployed to various clouds and GlassFish provides a feature-rich implementation of Java EE 6.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
Five essential new enhancements in azure HDnsightAshish Thapliyal
This document discusses features of Apache Spark on Azure HDInsight including a new Spark IO cache that provides significant performance improvements of up to 9x for Spark queries. It also discusses other HDInsight features like Hive LLAP for interactive querying, data analytics templates, and tools for Spark job debugging and diagnosis. Azure HDInsight is presented as a secure, managed Hadoop and Spark cloud platform for building data lakes on Azure.
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
This session will discuss how Cassandra/Solr can be used to create real-time analytics platform – jKool.
jKool provides an in-memory analysis of time-series data, automatically performing sequencing, correlation, grouping, enriching, synchronizing, computing, querying and displaying data streams. The session will discuss architecture, challenges and approaches taken to create a real-time analytics platform on top of open source big data analytics platforms: Cassandra, Solr, Kafka & Spark.
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Timothy Spann
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
An overview for Big Data Engineers on how one could use Apache projects to run deep learning workflows with Apache NiFi, YARN, Spark, Kafka and many other Apache projects.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
4. 4
• Jaws, http spark sql rest service
• http://github.com/Atigeo/http-spark-sql-server
Backward compatible with Shark and Spark 0.x stack
• Spark Job Server
multiple Spark contexts in same JVM, job submission in Java + Scala
https://github.com/Atigeo/spark-job-rest
• Mesos framework starvation bug
https://github.com/Atigeo/mesos_starvation
• Tachyon patch (https://github.com/amplab/tachyon/pull/482)
BDAS++
5. 5
• persist(OFF_HEAP) temporary storage
• RDD.persist() OFF_HEAP > MEMORY_SER_AND_DISK
• count() when ser/de is expensive, gc cost
• Shuffle Spark vs Hadoop, file consolidation, spillage, compression
• Storage vs memory fraction
• Kryo
• Do not set spark.executor.uri
Spark Internals
6. 6
• partial in-memory file storage bug
• HadoopRDD vs TachyonRDD
• journal file on hdfs -> backup of local master disk
• hdfs api
• RawTable in Shark
• native API: getInStream(CACHE|NO_CACHE) -> local workers
• do not evict blocks when streaming to Tachyon/hdfs
• Tachyon > Spark JVM Cache for long running jobs
• kryo/defaultCodec/sequenceFile format to minimize memory footprint
• Parquet is better! Schema and similar performance
• ramDisk vs SSD
Tachyon Internals
7. 7
• … vs YARN
• Fine-grained vs coarse-grained Spark contexts
• *SchedulerBackend and *ExecutorBackend
• Hadoop-mesos plugin
• Master failover (zk)
• Framework starvation patch
Mesos Internals