In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document summarizes SparkR, which allows running Spark programs from R. It discusses the motivation, history and creators of SparkR. It then explains the SparkR architecture and how to use SparkR, including running SparkR scripts and programs, accessing the SparkR API, converting data formats, and performing descriptive and predictive analytics. Finally, it demonstrates using SparkR for a credit card fraud detection example, including collecting and sampling data before building a logistic regression model to classify transactions as fraudulent or not.
Spark is going to replace Apache Hadoop! Know Why?Edureka!
The document discusses how Spark is emerging to replace Hadoop for big data processing. It notes that Hadoop MapReduce is limited to batch processing and is not fast enough for real-time processing needs. In contrast, Spark is up to 100 times faster than Hadoop MapReduce, supports both batch and real-time processing, and stores data in memory for faster analysis. A survey is cited showing increasing adoption of Spark over Hadoop in industries handling large volumes of data. The document concludes that while Hadoop will still be used, Spark will replace Hadoop MapReduce as the primary framework for big data applications due to its ability to support real-time processing demands.
This document summarizes Taboola's use of Spark and Cassandra for real-time data analysis. Taboola uses a dedicated Spark and Cassandra cluster consisting of over 5000 cores and 1PB of storage across two data centers to process 5TB of incoming data daily. Taboola loads data from Cassandra into Spark DataFrames using custom classes and analyzes the data in real-time for recommendations, reports, algorithms, and analytics. Zeppelin notebooks are used to execute Spark jobs on this data for further analysis and algorithm development.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks
Using a live coding demonstration attendee’s will learn how to deploy scala spark jobs onto any kubernetes environment using helm and learn how to make their deployments more scalable and less need for custom configurations, resulting into a boilerplate free, highly flexible and stress free deployments.
This document introduces Silverlight 4 and provides an overview of its new features. Silverlight 4 includes enhancements for media, business applications, and extending functionality beyond the browser. It also improves performance and development tools. The document recommends resources for learning more about Silverlight 4 and other Microsoft technologies.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document summarizes SparkR, which allows running Spark programs from R. It discusses the motivation, history and creators of SparkR. It then explains the SparkR architecture and how to use SparkR, including running SparkR scripts and programs, accessing the SparkR API, converting data formats, and performing descriptive and predictive analytics. Finally, it demonstrates using SparkR for a credit card fraud detection example, including collecting and sampling data before building a logistic regression model to classify transactions as fraudulent or not.
Spark is going to replace Apache Hadoop! Know Why?Edureka!
The document discusses how Spark is emerging to replace Hadoop for big data processing. It notes that Hadoop MapReduce is limited to batch processing and is not fast enough for real-time processing needs. In contrast, Spark is up to 100 times faster than Hadoop MapReduce, supports both batch and real-time processing, and stores data in memory for faster analysis. A survey is cited showing increasing adoption of Spark over Hadoop in industries handling large volumes of data. The document concludes that while Hadoop will still be used, Spark will replace Hadoop MapReduce as the primary framework for big data applications due to its ability to support real-time processing demands.
This document summarizes Taboola's use of Spark and Cassandra for real-time data analysis. Taboola uses a dedicated Spark and Cassandra cluster consisting of over 5000 cores and 1PB of storage across two data centers to process 5TB of incoming data daily. Taboola loads data from Cassandra into Spark DataFrames using custom classes and analyzes the data in real-time for recommendations, reports, algorithms, and analytics. Zeppelin notebooks are used to execute Spark jobs on this data for further analysis and algorithm development.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks
Using a live coding demonstration attendee’s will learn how to deploy scala spark jobs onto any kubernetes environment using helm and learn how to make their deployments more scalable and less need for custom configurations, resulting into a boilerplate free, highly flexible and stress free deployments.
This document introduces Silverlight 4 and provides an overview of its new features. Silverlight 4 includes enhancements for media, business applications, and extending functionality beyond the browser. It also improves performance and development tools. The document recommends resources for learning more about Silverlight 4 and other Microsoft technologies.
Interested in how spark can fit into a data scientist toolbox? Let Rajiv share his knowledge of using Spark alongside other tools. This talk is from someone that lives in R/Python for most of the day. He will provide an introduction and comparison to how Spark fits in and when its best used. Along the way he will share his thinking on various vendors, languages, and algorithms in spark. As part of the presentation, Rajiv will share some simple examples of using spark streaming, recommenders, as well as other java apps (e.g,H2O) with spark. By the end, you will have one perspective on when its best to use spark, how to use it, and where it is going.
Slide Deck from our 2013 SANDCamp presentation. More of the content was likely captured in the conversation, as we used this deck as a jumping off point for the chat, but there's still some worthwhile concepts in there.
The document discusses a meetup about integrating Concourse and Spinnaker. It covers why Spinnaker is useful for continuous delivery, specifically blue/green deployments, rollbacks, and automated canary analysis. It then discusses how Concourse and Spinnaker can be integrated using the Concourse Spinnaker resource to trigger Spinnaker pipelines from Concourse and vice versa. A demo is shown of building a Docker image, deploying it to Spinnaker, running tests with JMeter, and rolling back if tests fail.
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on JEEConf 2015 in Kyiv.
Design by Yarko Filevych: http://www.filevych.com/
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...Databricks
Spark has established itself as the most popular platform for advanced scale-out analytical applications. It is deeply integrated with the Hadoop ecosystem, offers a set of powerful libraries and supports both Python and R. Because of these reasons Data Scientists have started to adopt Spark to train and deploy their models. When Spark 1.4 was released back in 2015, it included the new SparkR library: this API gave R users the exciting new option to run R code on Spark.
And while the initial promise to provide a full R environment in Spark has been kept, it takes a deeper understanding of SparkR’s inner workings to make optimal use of its capabilities. This talk will give a comprehensive update on where we stand with Data Science applications in R based on the latest Spark releases. We will share insights from both a Startup solution and a Fortune 100 company where SparkR does Machine Learning in the Cloud on a scale that would have not been feasible previously: it’s parallel execution model runs in minutes and hours whereas conventional sequential approaches would take days and months.
Suggested Topics:
• An update on the SparkR architecture in the latest Spark release: using R with SparkSQL, MLlib and Spark’s Structured Streaming
• How to handle practical challenges, e.g. running R on the cluster without a local installation, storing non-tabular results, such as Data Science models or plots, mixing Scala and R.
• Scaling Big Compute Applications with SparkR: Parallelizing SparkR applications with User-Defined Functions (UDFs) and elastic scaling of resources in the Cloud
• An Outlook on Machine Learning with SparkR and its ecosystem, frameworks and tools.
• Plus: “Do I need to learn Python?”
This document discusses 5 reasons why Apache Spark is in high demand: 1) Low latency processing by keeping data in memory, 2) Support for streaming data through resilient distributed datasets (RDDs), 3) Integration of machine learning and graph processing libraries, 4) DataFrame API for easier data analysis, and 5) Ability to integrate with Hadoop for large scale data processing. It provides details on Spark's architecture and benchmarks showing its faster performance compared to Hadoop for tasks like sorting large datasets.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.
Reducing Pager Fatigue Using a Serverless ML BotMike Fowler
Being woken up at 3 am by the pager is never fun but seeing an incident resolve before you’ve even left the bed is maddening. Sleepily the next day you tune the alert for a better night’s sleep yet more untuned alerts sing to you in your sleep. After a few rounds of alert-tuning whack-a-mole you wonder: Could I predict if an incident will resolve itself?
This is the story of how a weary engineer used a Cloud ML model with Cloud Functions to reduce pager noise. Recounting some of the challenges faced, we’ll explore training a model with a limited data set & continual training in a serverless environment. We’ll also explore the implications of using a bot as a first responder to a pager.
Timothy Spann, a Principal DataFlow Field Engineer at Cloudera, gave a presentation about Apache Flink SQL for continuous SQL/ETL/applications and Apache NiFi for DevOps. The presentation included demos of building real-time streaming pipelines with Flink and using the NiFi CLI, REST API, and NiPyAPI for NiFi DevOps. Upcoming events were also announced.
Several techniques are available nowadays for scaling data analytics in R. In this presentation we are going to introduce Apache Spark, a general engine for large-scale distributed data processing. We will show how the sparklyr package can be used as a dplyr backend to leverage the resources in a Spark cluster. We will also introduce sparkbq, a sparklyr extension package providing support for Google BigQuery.
Facilitating Possibility: Appreciative Inquiry as a Tool for Content StrategyKatherine Krause
So much of content strategy work depends upon our ability to lead and affect change with an organization. Appreciative Inquiry is a positive orientation to change management that I have found to be especially helpful in the work of enterprise content strategy. Here's an introduction to the AI approach with some application to the world of content development and governance.
Sydney Apache Spark Meetup - Spark Natural Language ProcessingAndy Huang
In this talk, we shared our experience in using Spark to perform natural language processing tasks to drive business value for our clients. We demonstrated the capabilities of word embedding using Spark Word2Vec followed by showing how third party natural language models can be incorporated into Spark applications.
Spark can process data faster than Hadoop by keeping data in-memory as much as possible to avoid disk I/O. It supports streaming data, machine learning algorithms, graph processing, and SQL queries on structured data using its DataFrame API. Spark can integrate with Hadoop by running on YARN and accessing data from HDFS. The key capabilities discussed include low latency processing, streaming, machine learning, graph processing, DataFrames, and Hadoop integration.
In this month's podcast I discuss some recent news about ebooks and DRM. There's information about smartphone uses, from Pew Internet, and a quick debate about mobile websites versus apps. FourSquare and geosocial services are explained, in brief. A good portion of the show describes SWON's new partnership with Hive13, a hacker/maker space in Cincinnati. What is that? Listen in to find out.
The document discusses how libraries are designing their search boxes and references several articles on the topic. It lists different library catalog and discovery systems like Summon, Sierra, Libris, and 360 Link. It also mentions 'Bento-style' search interfaces and discusses metadata and APIs used in library search systems.
This document discusses serverless computing and functions as a service (FaaS). It begins by defining serverless and FaaS, noting that FaaS is a means to achieve serverless architectures. It then charts the evolution of infrastructure models from on-premise hardware to platform as a service (PaaS) and container as a service (CaaS). The document examines the financial model of FaaS, how it incentivizes modularization. It provides examples of pricing for AWS Lambda. It also outlines some common serverless platforms, use cases, considerations for the future of serverless, and concludes with contact information for the author.
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
The document discusses optimizing Spark machine learning pipelines. It describes using parallel model evaluation to speed up hyperparameter tuning by training multiple models simultaneously. This reduces the time spent on cross-validation for hyperparameter selection. The document also discusses optimizing tuning for pipeline models by treating the pipeline as a directed acyclic graph and parallelizing the fitting in breadth-first order to avoid duplicating work where possible.
Interested in how spark can fit into a data scientist toolbox? Let Rajiv share his knowledge of using Spark alongside other tools. This talk is from someone that lives in R/Python for most of the day. He will provide an introduction and comparison to how Spark fits in and when its best used. Along the way he will share his thinking on various vendors, languages, and algorithms in spark. As part of the presentation, Rajiv will share some simple examples of using spark streaming, recommenders, as well as other java apps (e.g,H2O) with spark. By the end, you will have one perspective on when its best to use spark, how to use it, and where it is going.
Slide Deck from our 2013 SANDCamp presentation. More of the content was likely captured in the conversation, as we used this deck as a jumping off point for the chat, but there's still some worthwhile concepts in there.
The document discusses a meetup about integrating Concourse and Spinnaker. It covers why Spinnaker is useful for continuous delivery, specifically blue/green deployments, rollbacks, and automated canary analysis. It then discusses how Concourse and Spinnaker can be integrated using the Concourse Spinnaker resource to trigger Spinnaker pipelines from Concourse and vice versa. A demo is shown of building a Docker image, deploying it to Spinnaker, running tests with JMeter, and rolling back if tests fail.
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on JEEConf 2015 in Kyiv.
Design by Yarko Filevych: http://www.filevych.com/
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...Databricks
Spark has established itself as the most popular platform for advanced scale-out analytical applications. It is deeply integrated with the Hadoop ecosystem, offers a set of powerful libraries and supports both Python and R. Because of these reasons Data Scientists have started to adopt Spark to train and deploy their models. When Spark 1.4 was released back in 2015, it included the new SparkR library: this API gave R users the exciting new option to run R code on Spark.
And while the initial promise to provide a full R environment in Spark has been kept, it takes a deeper understanding of SparkR’s inner workings to make optimal use of its capabilities. This talk will give a comprehensive update on where we stand with Data Science applications in R based on the latest Spark releases. We will share insights from both a Startup solution and a Fortune 100 company where SparkR does Machine Learning in the Cloud on a scale that would have not been feasible previously: it’s parallel execution model runs in minutes and hours whereas conventional sequential approaches would take days and months.
Suggested Topics:
• An update on the SparkR architecture in the latest Spark release: using R with SparkSQL, MLlib and Spark’s Structured Streaming
• How to handle practical challenges, e.g. running R on the cluster without a local installation, storing non-tabular results, such as Data Science models or plots, mixing Scala and R.
• Scaling Big Compute Applications with SparkR: Parallelizing SparkR applications with User-Defined Functions (UDFs) and elastic scaling of resources in the Cloud
• An Outlook on Machine Learning with SparkR and its ecosystem, frameworks and tools.
• Plus: “Do I need to learn Python?”
This document discusses 5 reasons why Apache Spark is in high demand: 1) Low latency processing by keeping data in memory, 2) Support for streaming data through resilient distributed datasets (RDDs), 3) Integration of machine learning and graph processing libraries, 4) DataFrame API for easier data analysis, and 5) Ability to integrate with Hadoop for large scale data processing. It provides details on Spark's architecture and benchmarks showing its faster performance compared to Hadoop for tasks like sorting large datasets.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.
Reducing Pager Fatigue Using a Serverless ML BotMike Fowler
Being woken up at 3 am by the pager is never fun but seeing an incident resolve before you’ve even left the bed is maddening. Sleepily the next day you tune the alert for a better night’s sleep yet more untuned alerts sing to you in your sleep. After a few rounds of alert-tuning whack-a-mole you wonder: Could I predict if an incident will resolve itself?
This is the story of how a weary engineer used a Cloud ML model with Cloud Functions to reduce pager noise. Recounting some of the challenges faced, we’ll explore training a model with a limited data set & continual training in a serverless environment. We’ll also explore the implications of using a bot as a first responder to a pager.
Timothy Spann, a Principal DataFlow Field Engineer at Cloudera, gave a presentation about Apache Flink SQL for continuous SQL/ETL/applications and Apache NiFi for DevOps. The presentation included demos of building real-time streaming pipelines with Flink and using the NiFi CLI, REST API, and NiPyAPI for NiFi DevOps. Upcoming events were also announced.
Several techniques are available nowadays for scaling data analytics in R. In this presentation we are going to introduce Apache Spark, a general engine for large-scale distributed data processing. We will show how the sparklyr package can be used as a dplyr backend to leverage the resources in a Spark cluster. We will also introduce sparkbq, a sparklyr extension package providing support for Google BigQuery.
Facilitating Possibility: Appreciative Inquiry as a Tool for Content StrategyKatherine Krause
So much of content strategy work depends upon our ability to lead and affect change with an organization. Appreciative Inquiry is a positive orientation to change management that I have found to be especially helpful in the work of enterprise content strategy. Here's an introduction to the AI approach with some application to the world of content development and governance.
Sydney Apache Spark Meetup - Spark Natural Language ProcessingAndy Huang
In this talk, we shared our experience in using Spark to perform natural language processing tasks to drive business value for our clients. We demonstrated the capabilities of word embedding using Spark Word2Vec followed by showing how third party natural language models can be incorporated into Spark applications.
Spark can process data faster than Hadoop by keeping data in-memory as much as possible to avoid disk I/O. It supports streaming data, machine learning algorithms, graph processing, and SQL queries on structured data using its DataFrame API. Spark can integrate with Hadoop by running on YARN and accessing data from HDFS. The key capabilities discussed include low latency processing, streaming, machine learning, graph processing, DataFrames, and Hadoop integration.
In this month's podcast I discuss some recent news about ebooks and DRM. There's information about smartphone uses, from Pew Internet, and a quick debate about mobile websites versus apps. FourSquare and geosocial services are explained, in brief. A good portion of the show describes SWON's new partnership with Hive13, a hacker/maker space in Cincinnati. What is that? Listen in to find out.
The document discusses how libraries are designing their search boxes and references several articles on the topic. It lists different library catalog and discovery systems like Summon, Sierra, Libris, and 360 Link. It also mentions 'Bento-style' search interfaces and discusses metadata and APIs used in library search systems.
This document discusses serverless computing and functions as a service (FaaS). It begins by defining serverless and FaaS, noting that FaaS is a means to achieve serverless architectures. It then charts the evolution of infrastructure models from on-premise hardware to platform as a service (PaaS) and container as a service (CaaS). The document examines the financial model of FaaS, how it incentivizes modularization. It provides examples of pricing for AWS Lambda. It also outlines some common serverless platforms, use cases, considerations for the future of serverless, and concludes with contact information for the author.
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
The document discusses optimizing Spark machine learning pipelines. It describes using parallel model evaluation to speed up hyperparameter tuning by training multiple models simultaneously. This reduces the time spent on cross-validation for hyperparameter selection. The document also discusses optimizing tuning for pipeline models by treating the pipeline as a directed acyclic graph and parallelizing the fitting in breadth-first order to avoid duplicating work where possible.
Similar to Shiny Spark (閃亮的火花) (20171227 - Spark.TW 3rd Anniversary Sharing) (20)
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
11. / 25
about shiny…
• 新的 framework,觀望⼀一下...
• google trend
• https://trends.google.com.tw/trends/explore?
date=all&q=R%20shiny,d3.js
• production
• https://shiny.rstudio.com/gallery/see-more.html
11
12. / 25
about shiny…
• like MVC
• demo
http://13.115.248.137:3838/kmeans/
model ⬌ function.R
view ⬌ ui.R
control ⬌ server.R
12
13. / 25
shiny vs. tableau
shiny tableau
pros
• free
• data science
packages / algorithms
• scale out
• easy to production
• easy to maintain
cons
• hard to production
• hard to maintain
• not free
• hard to import
packages / algorithms
• hard to scale out
13