This document provides an overview of Apache Spark, including:
- Apache Spark is a next generation data processing engine for Hadoop that allows for fast in-memory processing of huge distributed and heterogeneous datasets.
- Spark offers tools for data science and components for data products and can be used for tasks like machine learning, graph processing, and streaming data analysis.
- Spark improves on MapReduce by being faster, allowing parallel processing, and supporting interactive queries. It works on both standalone clusters and Hadoop clusters.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
Introduction to Big Data Analytics using Apache Spark on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS)
This workshop will provide an introduction to Big Data Analytics using Apache Spark using the HDInsights on Azure (SaaS) and/or HDP deployment on Azure(PaaS) . There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
Opal: Simple Web Services Wrappers for Scientific ApplicationsSriram Krishnan
The grid-based infrastructure enables large-scale scientific applications to be run on distributed resources and coupled in innovative ways. However, in practice, grid resources are not very easy to use for the end-users who have to learn how to generate security credentials, stage inputs and outputs, access grid-based schedulers, and install complex client software. There is an imminent need to provide transparent access to these resources so that the end-users are shielded from the complicated details, and free to concentrate on their domain science. Scientific applications wrapped as Web services alleviate some of these problems by hiding the complexities of the back-end security and computational infrastructure, only exposing a simple SOAP API that can be accessed programmatically by application-specific user interfaces. However, writing the application services that access grid resources can be quite complicated, especially if it has to be replicated for every application. In this presentation, we present Opal which is a toolkit for wrapping scientific applications as Web services in a matter of hours, providing features such as scheduling, standards-based grid security and data management in an easy-to-use and configurable manner
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
Introduction to Big Data Analytics using Apache Spark on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS)
This workshop will provide an introduction to Big Data Analytics using Apache Spark using the HDInsights on Azure (SaaS) and/or HDP deployment on Azure(PaaS) . There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
Opal: Simple Web Services Wrappers for Scientific ApplicationsSriram Krishnan
The grid-based infrastructure enables large-scale scientific applications to be run on distributed resources and coupled in innovative ways. However, in practice, grid resources are not very easy to use for the end-users who have to learn how to generate security credentials, stage inputs and outputs, access grid-based schedulers, and install complex client software. There is an imminent need to provide transparent access to these resources so that the end-users are shielded from the complicated details, and free to concentrate on their domain science. Scientific applications wrapped as Web services alleviate some of these problems by hiding the complexities of the back-end security and computational infrastructure, only exposing a simple SOAP API that can be accessed programmatically by application-specific user interfaces. However, writing the application services that access grid resources can be quite complicated, especially if it has to be replicated for every application. In this presentation, we present Opal which is a toolkit for wrapping scientific applications as Web services in a matter of hours, providing features such as scheduling, standards-based grid security and data management in an easy-to-use and configurable manner
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
Real-time/online machine learning is an integral piece in the machine learning landscape, particularly in regard to unsupervised learning. Areas such as focused advertising, stock price prediction, recommendation engines, network evolution and IoT streams in smart cities and smart homes are increasing in demand and scale. Continuously-updating models with efficient update methodologies, accurate labeling, feature extraction, and modularity for mixed models are integral to maintaining scalability, precision, and accuracy in high demand scenarios.
This session explores a real-time/online learning algorithm and implementation using Spark Streaming in a hybrid batch/ semi-supervised setting. It presents an easy-to-use, highly scalable architecture with advanced customization and performance optimization. Within this framework, we will examine some of the key methodologies for implementing the algorithm, including partitioning and aggregation schemes, feature extraction, model evaluation and correction over time, and our approaches to minimizing loss and improving convergence. The result is a simple, accurate pipeline that can be easily adapted and scaled to a variety of use cases.
The performance of the algorithm will be evaluated comparatively against existing implementations in both linear and logistic prediction. The session will also cover real-time uses cases of the streaming pipeline using real time-series data and present strategies for optimization and implementation to improve both accuracy and efficiency in a semi-supervised setting.
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
Devon Energy is a Fortune 500 company focused on unconventional upstream oil and gas production. With a companywide focus on innovation and data-driven decision making, IT has been challenged to make more data available to more people more quickly. To this end, we have leveraged the scale of Microsoft Azure and Databricks’ Unified Analytics Platform to help reimagine our integration, data warehousing and analytics landscape to improve agility while moving our workloads to the cloud. We are in the third year of this transformation and have lessons learned around improving the testability of data pipelines, code management, model training and deployment, promotion, and user empowerment. In this talk, we will share our experience managing the lifecycle of data engineering and machine learning solutions and striking the balance between agility and reliability in a single platform, while democratizing data access to users from all disciplines across the company.
Author: Paul Bruffett
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
We present the Azure Cognitive Services on Spark, a simple and easy to use extension of the SparkML Library to all Azure Cognitive Services. This integration allows Spark Users to embed cloud intelligence directly into their spark computations, enabling a new generation of intelligent applications on Spark. Furthermore, we show that with our new Containerized Cognitive Services, one can embed cloud intelligence directly into the Spark cluster for ultra-low latency, on-prem, and offline applications. We show how using our Integration, one can compose these cognitive services with other services, SQL computations, and Deep Networks to create sophisticated and intelligent heterogenous applications. Moreover, we show how to redeploy these compositions as Restful Services with Spark Serving. We will also explore the architecture of these contributions which leverage HTTP on Spark, a novel integration between Spark with the widely used Hypertext Transfer Protocol (HTTP). This library can integrate any framework into the Spark ecosystem that is capable of communicating through HTTP. Finally, we demonstrate how to use these services to create a large class of intelligent applications such as custom search engines, realtime facial recognition systems, and unsupervised object detectors.
In this presentation, Zoosk will share its experience in transitioning the Zoosk Big Data Platform from Hive to a Hive/Impala configuration. We will share lessons learned, some guidelines about when to use one or another, and a high level before-and-after view of its architecture.
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage.
But on production workloads, we see that many of the nodes can’t be taken away because:
Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. – each node is running 1-2 containers/executors although they have resources to run 4 containers.
Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages.
In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don’t have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data – thus freeing them from storage and hence available for reclamation for faster downscaling.
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
Real-time/online machine learning is an integral piece in the machine learning landscape, particularly in regard to unsupervised learning. Areas such as focused advertising, stock price prediction, recommendation engines, network evolution and IoT streams in smart cities and smart homes are increasing in demand and scale. Continuously-updating models with efficient update methodologies, accurate labeling, feature extraction, and modularity for mixed models are integral to maintaining scalability, precision, and accuracy in high demand scenarios.
This session explores a real-time/online learning algorithm and implementation using Spark Streaming in a hybrid batch/ semi-supervised setting. It presents an easy-to-use, highly scalable architecture with advanced customization and performance optimization. Within this framework, we will examine some of the key methodologies for implementing the algorithm, including partitioning and aggregation schemes, feature extraction, model evaluation and correction over time, and our approaches to minimizing loss and improving convergence. The result is a simple, accurate pipeline that can be easily adapted and scaled to a variety of use cases.
The performance of the algorithm will be evaluated comparatively against existing implementations in both linear and logistic prediction. The session will also cover real-time uses cases of the streaming pipeline using real time-series data and present strategies for optimization and implementation to improve both accuracy and efficiency in a semi-supervised setting.
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
Devon Energy is a Fortune 500 company focused on unconventional upstream oil and gas production. With a companywide focus on innovation and data-driven decision making, IT has been challenged to make more data available to more people more quickly. To this end, we have leveraged the scale of Microsoft Azure and Databricks’ Unified Analytics Platform to help reimagine our integration, data warehousing and analytics landscape to improve agility while moving our workloads to the cloud. We are in the third year of this transformation and have lessons learned around improving the testability of data pipelines, code management, model training and deployment, promotion, and user empowerment. In this talk, we will share our experience managing the lifecycle of data engineering and machine learning solutions and striking the balance between agility and reliability in a single platform, while democratizing data access to users from all disciplines across the company.
Author: Paul Bruffett
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
We present the Azure Cognitive Services on Spark, a simple and easy to use extension of the SparkML Library to all Azure Cognitive Services. This integration allows Spark Users to embed cloud intelligence directly into their spark computations, enabling a new generation of intelligent applications on Spark. Furthermore, we show that with our new Containerized Cognitive Services, one can embed cloud intelligence directly into the Spark cluster for ultra-low latency, on-prem, and offline applications. We show how using our Integration, one can compose these cognitive services with other services, SQL computations, and Deep Networks to create sophisticated and intelligent heterogenous applications. Moreover, we show how to redeploy these compositions as Restful Services with Spark Serving. We will also explore the architecture of these contributions which leverage HTTP on Spark, a novel integration between Spark with the widely used Hypertext Transfer Protocol (HTTP). This library can integrate any framework into the Spark ecosystem that is capable of communicating through HTTP. Finally, we demonstrate how to use these services to create a large class of intelligent applications such as custom search engines, realtime facial recognition systems, and unsupervised object detectors.
In this presentation, Zoosk will share its experience in transitioning the Zoosk Big Data Platform from Hive to a Hive/Impala configuration. We will share lessons learned, some guidelines about when to use one or another, and a high level before-and-after view of its architecture.
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage.
But on production workloads, we see that many of the nodes can’t be taken away because:
Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. – each node is running 1-2 containers/executors although they have resources to run 4 containers.
Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages.
In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don’t have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data – thus freeing them from storage and hence available for reclamation for faster downscaling.
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
ПОЛОЖЕНИЕ О СОРЕВНОВАНИЯХ
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017. "Открытый Чемпионат и Первенство Ассоциации "Минская Федерация Каратэ" по каратэ среди детей, кадетов, юниоров и взрослых".
22 января 2017 года.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
Sparking Data in the Cloud: Data isn’t useful until it’s used to drive decision-making. Companies, like Pinterest, are using Machine Learning to build data-driven recommendation engines and perform advanced cluster analysis. In this talk, Praveen Seluka will cover best practices for running Spark in the cloud, common challenges in iterative design and interactive analysis.
3 Things to Learn About:
* How Sparklyr supports a complete backend for dplyr, a popular tool for working with data frame objects both in memory and out of memory
* How Sparklyr llows data scientists to use dplyr to translate R code into Spark SQL
* How Sparklyr supports MLlib so data scientists can run classifiers, regressions, and many other machine learning algorithms in Spark
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Doug Cutting discusses:
- A brief history of Spark and its rise in popularity across developers and enterprises
- Spark's advantages over MapReduce
- The One Platform Initiative and the roadmap for Spark
- The future of data processing in Hadoop
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
Similar to Apache Spark in Scientific Applciations (20)
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
In this presentation I do a review of the architecture of an AI application for IoT environments.
Since specific modeling and training aspects also have an impact on the final implementation of an enterprise ready solution, such solutions become very complex pretty soon.
The complexity of AI system for IoT is a big challenge – thus, I want to break this complexity down into particular views, which emphasize the individual but still interconnected aspects more clearly.
Improving computer vision models at scale (Strata Data NYC)Dr. Mirko Kämpf
Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. When testing data is present at the petabyte scale, the ability to seamlessly access all the images that have been assigned specific labels poses a technical challenge by itself.
We share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable. Images and labels are stored in HBase. The model is encapsulated in a (Py)Spark program, while the images are indexed with Solr and can be accessed from a Hue dashboard. Triplification of facts, detected inside images contributes to a large knowledge graph, queryable via SPARQL.
Improving computer vision models at scale presentationDr. Mirko Kämpf
Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. When testing data is present at the petabyte scale, the ability to seamlessly access all the images that have been assigned specific labels poses a technical challenge by itself.
Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable. Images and labels are stored in HBase. The model is encapsulated in a PySpark program, while the images are indexed with Solr and can be accessed from a Hue dashboard.
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
Cybersecurity is a broad topic and many commercial products are related to it. We demonstrate a fundamental concept in network analysis: re-construction and visualization of temporal networks. Furthermore, we apply the method to describe operational conditions of a Hadoop cluster. Our experiments provide first results and allow a classification of the cluster state related to current workloads. The temporal networks show significant differences for different operation modes. In reallity we would expect mixed workloads. If such workload parameters are known, we are able to handle a-typical events accordingly - which means, we are able to create alerts based on context information, rather than only the package content. We show an end-to-end example: (1) Data collection is done via python, using the sniffer script; (2) using Apache Hive and Apache Spark we analyze the network traffic data and create the temporary network. Finally, we are able to visualize the results using Gephi in step (3). In a next step, we plan to contribute to the Apache Spot project.
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
Etosha is an enterprise focused collaborative graph database with facts about data sets, analysis procedures, and research methods. People from multiple organizations can be connected while every owner retains full control about its own data.
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
Event processing, time series aggregation and analysis, and finally analysis of structural patterns between those data snippets can all be done on Hadoop clusters on huge data volumes.
In order to find hidden relations and invisible structures one has to combine three disciplines using a variety of tools. Luckily, the Hadoop ecosystem offers many of such tools. In this session you can see practical examples and a demonstration of the "Hadoop-Oscilloscope". Generic analysis patterns and recommendations towards selection of appropriate algorithms will also provide additional background.
Information Spread in the Context of Evacuation OptimizationDr. Mirko Kämpf
Abstract: Our evacuation simulation tool utilizes established algorithms for the emotional and intelligence driven motion of human beings in addition to a simple lattice gas simulation. We analyze the spread of information inside a restricted geometry of a real building and compare these results with the data from a simulation in the free space. We apply the DFA and the RIS statistic to our simulation dataset to detect phases or phase transitions of the whole system. We study the impact of communication technology by comparison of different update algorithms and exit strategies. These results help us to define basic functional requirements to the underlying communication technology and network topology as well as to the needed sensors.
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
Since the numbers of hypertext pages and hyperlinks in the WWW have been continuously growing for more than 20 years, the problem of finding relevant content has become increasingly important. We have developed and evaluated techniques for a time-dependent characterization of the global and local relevance of WWW pages based on document length, number of links, and cross-correlations in user-access time series. We focus on content and user activity in selected groups of Wikipedia articles as a first application mainly because of data availability. Our goal is the assignment of ranking values to a hypertext page
(node). The values shall cover static properties of the node and its neighbourhood (context) as well as dynamic properties derived from its page-view rates that depend on underlying communication processes. We show in several examples how this goal can be achieved.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.