Scalable AutoML for Time Series Forecasting using RayDatabricks
Time Series Forecasting is widely used in real world applications, such as network quality analysis in Telcos, log analysis for data center operations, predictive maintenance for high-value equipment, etc
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
Especially this document provide very useful and meaningful concepts about SnapLogic. Also this document will be more useful for beginner/intermediate level SnapLogic learners.
How to Define and Share your Event APIs using AsyncAPI and Event API Products...HostedbyConfluent
Defining Asynchronous APIs and sharing them with your developer community is the most effective way for internal app developers and partners to create new services using real-time event streams. But how do you do it? What specification do you use to define the APIs? What are the best practices for sharing them with the developer community? What framework can you use to code? And what’s next? How do you manage the lifecycle of these APIs? In this talk, Fran Mendez, founder of AsyncAPI and Jonathan Schabowsky, Solace CTO Architect will introduce you to the AsyncAPI specification and show you two different methods to define and share your event APIs, quickly get up to speed, and more. You will learn how to create a Kafka application using asynchronous APIs in minutes!
Engineering products for scale, speed and agilityAtul Narkhede
How you can ensure product scale-ability and performance while racing to meet market needs?
Software Product Development companies today work in a high-speed, dynamic and challenging environment. Starting with an idea, you need to build a Minimum Viable Product that you can take to the market for feedback, then incorporate user feedback, while still being ready to launch before competition. In this situation, how can you ensure that your products are reliable, scale-able and secure? The secret is in following the best practices of product engineering.
Watch the audio-visual recording of this talk at http://bit.ly/UMaCEq
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax Academy
In this in-depth workshop you will gain hands on experience with using Spark and Cassandra inside the DataStax Enterprise Platform. The focus of the workshop will be working through data analytics exercises to understand the major developer developer considerations. You will also gain an understanding of the internals behind the integration that allow for large scale data loading and analysis. It will also review some of the major machine learning libraries in Spark as an example of data analysis.
The workshop will start with a review the basics of how Spark and Cassandra are integrated. Then we will work through a series of exercises that will show how to perform large scale Data Analytics with Spark and Cassandra. A major part of the workshop will be to understand effective data modeling techniques in Cassandra that allow for fast parallel loading of the data into Spark to perform large scale analytics on that data. The exercises will also look at how to how to use the open source Spark Notebook to run interactive data analytics with the DataStax Enterprise Platform.
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Scalable AutoML for Time Series Forecasting using RayDatabricks
Time Series Forecasting is widely used in real world applications, such as network quality analysis in Telcos, log analysis for data center operations, predictive maintenance for high-value equipment, etc
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
Especially this document provide very useful and meaningful concepts about SnapLogic. Also this document will be more useful for beginner/intermediate level SnapLogic learners.
How to Define and Share your Event APIs using AsyncAPI and Event API Products...HostedbyConfluent
Defining Asynchronous APIs and sharing them with your developer community is the most effective way for internal app developers and partners to create new services using real-time event streams. But how do you do it? What specification do you use to define the APIs? What are the best practices for sharing them with the developer community? What framework can you use to code? And what’s next? How do you manage the lifecycle of these APIs? In this talk, Fran Mendez, founder of AsyncAPI and Jonathan Schabowsky, Solace CTO Architect will introduce you to the AsyncAPI specification and show you two different methods to define and share your event APIs, quickly get up to speed, and more. You will learn how to create a Kafka application using asynchronous APIs in minutes!
Engineering products for scale, speed and agilityAtul Narkhede
How you can ensure product scale-ability and performance while racing to meet market needs?
Software Product Development companies today work in a high-speed, dynamic and challenging environment. Starting with an idea, you need to build a Minimum Viable Product that you can take to the market for feedback, then incorporate user feedback, while still being ready to launch before competition. In this situation, how can you ensure that your products are reliable, scale-able and secure? The secret is in following the best practices of product engineering.
Watch the audio-visual recording of this talk at http://bit.ly/UMaCEq
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax Academy
In this in-depth workshop you will gain hands on experience with using Spark and Cassandra inside the DataStax Enterprise Platform. The focus of the workshop will be working through data analytics exercises to understand the major developer developer considerations. You will also gain an understanding of the internals behind the integration that allow for large scale data loading and analysis. It will also review some of the major machine learning libraries in Spark as an example of data analysis.
The workshop will start with a review the basics of how Spark and Cassandra are integrated. Then we will work through a series of exercises that will show how to perform large scale Data Analytics with Spark and Cassandra. A major part of the workshop will be to understand effective data modeling techniques in Cassandra that allow for fast parallel loading of the data into Spark to perform large scale analytics on that data. The exercises will also look at how to how to use the open source Spark Notebook to run interactive data analytics with the DataStax Enterprise Platform.
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Slidedeck is related to the presentation done for Azure Singapore user group about Monitoring Kubernetes with Prometheus and Grafana on 19 August 2021.
Covered Prometheus Architecture, installation using Prometheus operator, Service Monitor, Pod Monitor, Alert rules. Live demo included Prometheus and Grafana integrations for Spring Boot and .Net Core application. Monitoring for infrastructure / messaging platforms using RabbitMQ is also covered.
Youtube video recording - https://youtu.be/t8uenUoI4Mw
https://www.meetup.com/en-AU/mssgug/events/279925499
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...HostedbyConfluent
Server-Sent Events (SSE) is a server push technology where clients receive automatic server updates through the secure http connection. SSE can be used in apps like live stock updates, that use one way data communications and also helps to replace long polling by maintaining a single connection and keeping a continuous event stream going through it. We used a simple Kafka producer to publish messages onto Kafka topics and developed a reactive Kafka consumer by leveraging Spring Webflux to read data from Kafka topic in non-blocking manner and send data to clients that are registered with Kafka consumer without closing any http connections. This implementation allows us to send data in a fully asynchronous & non-blocking manner and allows us to handle a massive number of concurrent connections. We’ll cover:
•Push data to external or internal apps in near real time
•Push data onto the files and securely copy them to any cloud services
•Handle multiple third-party apps integrations
WSO2Con ASIA 2016: API Driven Innovation Within the EnterpriseWSO2
85% of enterprises have a digital transformation strategy in place, but only 30% have really execute on it. What about you? In this session, we will explore how enterprises can embark on their digital transformation journey, leveraging APIs and a service-based architecture. Isabelle shares how several customers have achieved their business goals and describes the technical approach they have taken to do so.
Michal Malohlava's presentation on Building Your Own Recommendation Engine 03.17.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...HostedbyConfluent
Embracing open source software for critical platform operations is a tough organizational evolution for a company of any size. This is particularly daunting for technology teams accustomed to a fully supported managed service. Come learn about how we are using OSS to modernize Health Care at UnitedHealth Group as a roadmap to adopt and offer OSS in your own organization!
Over the last three years, Kafka as a Service within UnitedHealth Group has gone from non-existent to being centrally managed and utilized by over 200 internal application teams as an essential component to our ecosystem. In this session, I will share how to tactically implement a Kafka as a Service platform offering within any organization with a very lean team and how to get broad adoption from engineers and leadership.
I'll discuss the engineering cultural changes needed, both on the DevOps team as well as more broadly, to adopt OSS. Spoiler: Documentation is the key to success. I will talk about some of our "aha" moments, including the importance of internal Terms of Service and how to encourage teams to "Google first." I will include things that haven't worked as well, such as requiring manual review of all topic creation PRs (this doesn't scale!).
Attendees will learn how to both stand up their own OSS offering as well as how to be a good internal consumer of other such offerings. Come ready to learn and laugh about my journey to offering OSS to thousands of people!
Google Charts is a JavaScript API for quickly creating beautiful charts and graphs that are powerful, simple to use, and best of all free. This talk explores how you can incorporate Google Charts into your Android apps using a WebView and very little code.
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
This talk is a case-study on how Apache Spark and the Spark-Solr library is being used at Flipp for driving search relevancy. Flipp is a Toronto based digital flyer and ecommerce company which helps shoppers save money on weekly shopping. Our customers have the option of browsing through our 5+ million products from the brick-and-mortar retailers in North America. This makes Search a very challenging function in our app. How to show the most relevant and personalized search results to users on a query?
The talk will focus on using user signals such as Click Through Rate (CTR) and Impressions to increase search relevancy. I will also talk about how PySpark is used to create the Flipp Search ETL platform for collecting user signals and reading product data from Solr. The problem scenario will be explained in which keyword search and basic relevancy algorithms become ineffective when dealing with a large product database. The solutions will cover the following implementations being used at Flipp to drive relevancy: – Utilizing user clicks and popularity data to derive and index normalized item weights to implement the Search Crowd Curation models in Apache Solr
– How around 5+ million items are classified into Google Categories in real time using Keras and Apache Spark to power product category curation in Solr.
– How to create a crowd sourced query intent categorizer in Solr using the Spark-Solr library.
– The use of offline and online metrics at Flipp for evaluating changes in search relevancy.
– Future plans for incorporating Kafka-connect in Apache Solr with structured streaming to perform real-time product indexing with Spark-Solr library.
Google Cloud Bangla session. I had given I talk. This is all about Google Firebase it's features and technical benefits. After a session, I had a small workshop so that people get realtime hands-on experiences
This post talks about various architectural decision and their driving reasons which was taken to build an REST API which need to deliver large amount of reporting data.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Slidedeck is related to the presentation done for Azure Singapore user group about Monitoring Kubernetes with Prometheus and Grafana on 19 August 2021.
Covered Prometheus Architecture, installation using Prometheus operator, Service Monitor, Pod Monitor, Alert rules. Live demo included Prometheus and Grafana integrations for Spring Boot and .Net Core application. Monitoring for infrastructure / messaging platforms using RabbitMQ is also covered.
Youtube video recording - https://youtu.be/t8uenUoI4Mw
https://www.meetup.com/en-AU/mssgug/events/279925499
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...HostedbyConfluent
Server-Sent Events (SSE) is a server push technology where clients receive automatic server updates through the secure http connection. SSE can be used in apps like live stock updates, that use one way data communications and also helps to replace long polling by maintaining a single connection and keeping a continuous event stream going through it. We used a simple Kafka producer to publish messages onto Kafka topics and developed a reactive Kafka consumer by leveraging Spring Webflux to read data from Kafka topic in non-blocking manner and send data to clients that are registered with Kafka consumer without closing any http connections. This implementation allows us to send data in a fully asynchronous & non-blocking manner and allows us to handle a massive number of concurrent connections. We’ll cover:
•Push data to external or internal apps in near real time
•Push data onto the files and securely copy them to any cloud services
•Handle multiple third-party apps integrations
WSO2Con ASIA 2016: API Driven Innovation Within the EnterpriseWSO2
85% of enterprises have a digital transformation strategy in place, but only 30% have really execute on it. What about you? In this session, we will explore how enterprises can embark on their digital transformation journey, leveraging APIs and a service-based architecture. Isabelle shares how several customers have achieved their business goals and describes the technical approach they have taken to do so.
Michal Malohlava's presentation on Building Your Own Recommendation Engine 03.17.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...HostedbyConfluent
Embracing open source software for critical platform operations is a tough organizational evolution for a company of any size. This is particularly daunting for technology teams accustomed to a fully supported managed service. Come learn about how we are using OSS to modernize Health Care at UnitedHealth Group as a roadmap to adopt and offer OSS in your own organization!
Over the last three years, Kafka as a Service within UnitedHealth Group has gone from non-existent to being centrally managed and utilized by over 200 internal application teams as an essential component to our ecosystem. In this session, I will share how to tactically implement a Kafka as a Service platform offering within any organization with a very lean team and how to get broad adoption from engineers and leadership.
I'll discuss the engineering cultural changes needed, both on the DevOps team as well as more broadly, to adopt OSS. Spoiler: Documentation is the key to success. I will talk about some of our "aha" moments, including the importance of internal Terms of Service and how to encourage teams to "Google first." I will include things that haven't worked as well, such as requiring manual review of all topic creation PRs (this doesn't scale!).
Attendees will learn how to both stand up their own OSS offering as well as how to be a good internal consumer of other such offerings. Come ready to learn and laugh about my journey to offering OSS to thousands of people!
Google Charts is a JavaScript API for quickly creating beautiful charts and graphs that are powerful, simple to use, and best of all free. This talk explores how you can incorporate Google Charts into your Android apps using a WebView and very little code.
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
This talk is a case-study on how Apache Spark and the Spark-Solr library is being used at Flipp for driving search relevancy. Flipp is a Toronto based digital flyer and ecommerce company which helps shoppers save money on weekly shopping. Our customers have the option of browsing through our 5+ million products from the brick-and-mortar retailers in North America. This makes Search a very challenging function in our app. How to show the most relevant and personalized search results to users on a query?
The talk will focus on using user signals such as Click Through Rate (CTR) and Impressions to increase search relevancy. I will also talk about how PySpark is used to create the Flipp Search ETL platform for collecting user signals and reading product data from Solr. The problem scenario will be explained in which keyword search and basic relevancy algorithms become ineffective when dealing with a large product database. The solutions will cover the following implementations being used at Flipp to drive relevancy: – Utilizing user clicks and popularity data to derive and index normalized item weights to implement the Search Crowd Curation models in Apache Solr
– How around 5+ million items are classified into Google Categories in real time using Keras and Apache Spark to power product category curation in Solr.
– How to create a crowd sourced query intent categorizer in Solr using the Spark-Solr library.
– The use of offline and online metrics at Flipp for evaluating changes in search relevancy.
– Future plans for incorporating Kafka-connect in Apache Solr with structured streaming to perform real-time product indexing with Spark-Solr library.
Google Cloud Bangla session. I had given I talk. This is all about Google Firebase it's features and technical benefits. After a session, I had a small workshop so that people get realtime hands-on experiences
This post talks about various architectural decision and their driving reasons which was taken to build an REST API which need to deliver large amount of reporting data.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...HostedbyConfluent
You have been building your applications with stateless microservices. You might even be a rockstar using Kafka for inter service communication. Everything works wonderfully but you feel you could do something more. You want your microservices to have a state.
Developing stateful microservices can be hard. I will share my experience with building stateful applications with Kafka and Spring Cloud Stream libraries.
Kafka Streams State Stores and Interactive Queries are the main building blocks. They are used by stream processing applications to store and query data. They can scale and be fault tolerant together with your application instances in your container platform. But there are some limitations and we need to know how to monitor their performance.
This session is targeted for developers who are interested in learning event streaming practices. Demo application code will be available to participants.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Any startup has to have a clear go-to-market strategy from the beginning. Similarly, any data science project has to have a go-to-production strategy from its first days, so it could go beyond proof-of-concept. Machine learning and artificial intelligence in production would result in hundreds of training pipelines and machine learning models that are continuously revised by teams of data scientists and seamlessly connected with web applications for tenants and users.
In this demo-based talk we will walk through the best practices for simplifying machine learning operations across the enterprise and providing a serverless abstraction for data scientists and data engineers, so they could train, deploy and monitor machine learning models faster and with better quality.
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020)
https://www.linkedin.com/in/erenavsarogullari/
https://www.linkedin.com/in/pavelhardak/
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
Presented by Pavel Hardak and Eren Avsarogullari (ApacheCon 2020)
https://www.linkedin.com/in/pavelhardak/
https://www.linkedin.com/in/erenavsarogullari/
Title:
Apache Spark Development Lifecycle at Workday
Abstract:
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
In this data management session, Christopher describes how to build robust and reliable data products in BigQuery and dbt, for PPC and SEO use cases. After an introduction to the modern data stack, six principles of reliable data products are presented, followed by the following use cases:
- Google Ads Conversion upload
- SEO sitemap efficiency report
- Google Shopping product rating sync
- Large-Scale link checker with advertools
- Inventory-based PPC campaigns with dbt
Here is the referenced selection of gists on github: https://gist.github.com/ChrisGutknecht
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayVishal Pawar
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Power Apps: A software as a service application platform that enables power users in line of business
roles to easily build and deploy custom business apps. You will learn how to build Canvas and Modeldriven
style of apps.
Common Data Service (CDS): Make it easier to bring your data together and quickly create powerful
apps using a compliant and scalable data service and app platform that’s integrated into Power Apps.
Power Automate: A business service for line of business specialists and IT pros to build automated
workflows intuitively.
Power BI: Self-service business intelligence capabilities, where end users can create reports and
dashboards by themselves, without having to depend on information technology staff or database
administrators.
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open SourceRodrigo Kono
Esta sessão é uma visão da abordagem da Microsoft para Linux e para Open Source, incluindo o cenário de desenvolvimento de software e os benefícios para você. Você vai conhecer o trabalho da Microsoft com o Linux e o código aberto, tanto em ambientes locais, quanto na nuvem pelo Azure. Você também irá tomar conhecimento como poderá desenvolver em tecnologia .NET, utilizando C# com o Linux e rodando independente de Windows Server.
SPSNYC2019 - What is Common Data Model and how to use it?Nicolas Georgeault
Are you using PowerApps? Not yet or maybe just the Canvas option? All you need to know about the CDS Database, the way to deploy it and the way to use it to modernize your business applications using both Canvas and Model-Driven Apps.
KNOWAGE evolution in 2022 mainly focuses on: new data preparation module and data federation in self-service process, augmented analytics to support every end-user touch point and provide automatic insights, usability and performance for a new effective UI, a core offering as SaaS ABI solution.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.
Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.
One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
3. Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement
4. Largest Content Discovery and
Monetization Network
550MMonthly Unique
Users
240BMonthly
Recommendations
10B+Daily User Events
5TB+Incoming Daily Data
5. • Using Spark in production since v0.8
• 6 Data Centers across the globe
• Dedicated Spark & Cassandra (for spark) cluster consists of
– 5000+ cores with 35TB of RAM memory and ~1PB of SSD local
storage, across 2 Data Centers.
• Data must be processed and analyzed in real time, for example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
What Does it Mean?
7. • Spark DataFrames: Simple and Fast Analysis of
Structured Data
https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis-
of-structured-data/
DataFrames
8.
9.
10. • From DataFrames to Tungsten: A Peek into Spark's
Future
https://spark-summit.org/2015/events/keynote-9/
• Deep Dive into Project Tungsten: Bringing Spark
Closer to Bare Metal
https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-
spark-closer-to-bare-metal/
Tungsten
11.
12.
13.
14. • Spark and Spark Streaming at Netflix
https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/
Interesting Users’ Experience - Netflix
15.
16. • How Spark Fits into Baidu's Scale
https://spark-summit.org/2015/events/keynote-10/
Interesting Users’ Experience - Baidu
17.
18. • Recipes for Running Spark Streaming Applications in
Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-
applications-in-production/
Databricks Practical Talks – Spark Streaming
36. SparkContext, SQLContext, ZeppelinContext are
automatically created and exposed as variable names
'sc', 'sqlContext' and 'z', respectively, both in scala and
python environments.
General Variables In Zeppelin
39. • Connect Zeppelin to the cluster (not
standalone)
• Load raw sessions data
• Run code (python/scala) for algorithmic
analysis
Zeppelin @Taboola - What’s next?
Tungsten motivation – CPU stayed the same for the last 10 years, so need to optimize code
(1) Runtime code generation
(2) Exploiting cache locality
(3) Off-heap memory management