This talk focuses on Netflix's transition from Oracle to SimpleDB -- a cloud-hosted, key-value store -- during Netflix's transition to the cloud (i.e. AWS). Stay tuned for future talks as Netflix evaluates more technologies, e.g. Cassandra.
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
Geospatial pipelines in Apache Spark are difficult because of the diversity of datasets and the challenge of harmonizing on a single dataframe. We have worked over the past year to review different pipeline tools that allow us to quickly combine steps to create new workflows or operate on new datasets. We have reviewed Dagster, Apache Spark MLflow pipelines, Prefect, and our own custom solutions. The talk will go over the pros and cons of each of these solutions and will show an actionable workflow implementation that any geospatial analyst can leverage. We will show how we can leverage a pipeline to run a traditional geospatial hotspot analysis. Interactive mapping within the Databricks platform will be demonstrated.
Stardog is a fast, scalable, lightweight RDF database for complex SPARQL queries. It features OWL 2 reasoning, transactions, a robust security layer, integrity constraint validation via Pellet 3, and world-class support.
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
Geospatial pipelines in Apache Spark are difficult because of the diversity of datasets and the challenge of harmonizing on a single dataframe. We have worked over the past year to review different pipeline tools that allow us to quickly combine steps to create new workflows or operate on new datasets. We have reviewed Dagster, Apache Spark MLflow pipelines, Prefect, and our own custom solutions. The talk will go over the pros and cons of each of these solutions and will show an actionable workflow implementation that any geospatial analyst can leverage. We will show how we can leverage a pipeline to run a traditional geospatial hotspot analysis. Interactive mapping within the Databricks platform will be demonstrated.
Stardog is a fast, scalable, lightweight RDF database for complex SPARQL queries. It features OWL 2 reasoning, transactions, a robust security layer, integrity constraint validation via Pellet 3, and world-class support.
Sprache als Werkzeug: DSLs mit Kotlin (JAX 2020)Frank Scheffler
Domänenspezifische Sprachen (engl. DSLs) sind seit jeher dazu geeignet, komplexe Ausdrücke kompakter und besser lesbar auszudrücken. Dabei befreien sie den Benutzer von der Notwendigkeit, wiederkehrende Programmfragmente zu pflegen. Sie reduzieren den Blick auf den wesentlichen Inhalt der zugrunde liegenden Domäne. Während es an sinnvollen Anwendungsgebieten von DSLs nicht mangelt, sind deren Funktionsweise und Erstellung oftmals zu Unrecht als Mysterium verschrieen. Domänenspezifische Sprachen unterteilen sich grundsätzlich in externe und interne DSLs. Externe DSLs definieren eine unabhängige eigene Sprache, z. B. Xtend oder SQL. Daher bedarf es zu deren Ausführung einer eigenständigen Syntaxanalyse, -validierung und eines Compilers oder Interpreters. Interne DSLs basieren auf Host-Programmiersprachen. Daher ist ihr Sprachumfang nicht abgeschlossen, sondern lässt sich durch deren Sprachelemente erweitern. Kotlin bietet mit Extension Functions und Lambdas with Receivers ideale Voraussetzungen für die Erstellung interner DSLs. Dies zeigt auch ein Blick auf die ständig wachsende Zahl von Kotlin-basierten DSLs, wie z. B. der Kotlin Gradle DSL oder der Spring Beans DSL. In diesem Vortrag sollen die Grundlagen zur Erstellung eigener DSLs mit Kotlin vermittelt werden. Anhand eines praxisorientierten Beispiels wird schrittweise eine DSL während des Vortrags erstellt.
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUDatabricks
Roblox is a global online platform bringing millions of people together through play, with over 37 million daily active users and millions of games on the platform. Machine learning is a key part of our ability to scale important services to our massive community. In this talk, we share our journey of scaling our deep learning text classifiers to process 50k+ requests per second at latencies under 20ms. We will share how we were able to not only make BERT fast enough for our users, but also economical enough to run in production at a manageable cost on CPU. Further details can be found in our blog post below:
https://robloxtechblog.com/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26
PelletServer wraps a range of semantic technologies -- query, reasoning, machine learning, planning, and constraint solving -- in a RESTful interface and sensible set of defaults & conventions. Even the wily shell programmer can build semweb apps with wget!
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way.
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Memory Optimization and Reliable Metrics in ML Pipelines at NetflixDatabricks
Netflix personalizes the experience for each member and this is achieved by several machine learning models. Our team builds infrastructure that powers these machine learning pipelines; primarily using Spark for feature generation and training.
Improving Mobile Payments With Real time Sparkdatamantra
Talk about real world spark streaming implementation for improving mobile payments experience. Presented at Target data meetup at Bangalore by Madhukara Phatak on 22/08/2015.
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Sprache als Werkzeug: DSLs mit Kotlin (JAX 2020)Frank Scheffler
Domänenspezifische Sprachen (engl. DSLs) sind seit jeher dazu geeignet, komplexe Ausdrücke kompakter und besser lesbar auszudrücken. Dabei befreien sie den Benutzer von der Notwendigkeit, wiederkehrende Programmfragmente zu pflegen. Sie reduzieren den Blick auf den wesentlichen Inhalt der zugrunde liegenden Domäne. Während es an sinnvollen Anwendungsgebieten von DSLs nicht mangelt, sind deren Funktionsweise und Erstellung oftmals zu Unrecht als Mysterium verschrieen. Domänenspezifische Sprachen unterteilen sich grundsätzlich in externe und interne DSLs. Externe DSLs definieren eine unabhängige eigene Sprache, z. B. Xtend oder SQL. Daher bedarf es zu deren Ausführung einer eigenständigen Syntaxanalyse, -validierung und eines Compilers oder Interpreters. Interne DSLs basieren auf Host-Programmiersprachen. Daher ist ihr Sprachumfang nicht abgeschlossen, sondern lässt sich durch deren Sprachelemente erweitern. Kotlin bietet mit Extension Functions und Lambdas with Receivers ideale Voraussetzungen für die Erstellung interner DSLs. Dies zeigt auch ein Blick auf die ständig wachsende Zahl von Kotlin-basierten DSLs, wie z. B. der Kotlin Gradle DSL oder der Spring Beans DSL. In diesem Vortrag sollen die Grundlagen zur Erstellung eigener DSLs mit Kotlin vermittelt werden. Anhand eines praxisorientierten Beispiels wird schrittweise eine DSL während des Vortrags erstellt.
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUDatabricks
Roblox is a global online platform bringing millions of people together through play, with over 37 million daily active users and millions of games on the platform. Machine learning is a key part of our ability to scale important services to our massive community. In this talk, we share our journey of scaling our deep learning text classifiers to process 50k+ requests per second at latencies under 20ms. We will share how we were able to not only make BERT fast enough for our users, but also economical enough to run in production at a manageable cost on CPU. Further details can be found in our blog post below:
https://robloxtechblog.com/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26
PelletServer wraps a range of semantic technologies -- query, reasoning, machine learning, planning, and constraint solving -- in a RESTful interface and sensible set of defaults & conventions. Even the wily shell programmer can build semweb apps with wget!
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way.
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Memory Optimization and Reliable Metrics in ML Pipelines at NetflixDatabricks
Netflix personalizes the experience for each member and this is achieved by several machine learning models. Our team builds infrastructure that powers these machine learning pipelines; primarily using Spark for feature generation and training.
Improving Mobile Payments With Real time Sparkdatamantra
Talk about real world spark streaming implementation for improving mobile payments experience. Presented at Target data meetup at Bangalore by Madhukara Phatak on 22/08/2015.
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
Framing the Argument: How to Scale Faster with NoSQLInside Analysis
The Briefing Room with Dr. Robin Bloor and IBM Cloudant
Live Webcast March 24, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=e8bf62408d47e76c43aa73be08377e41c
Context matters. Perspective matters. Thinking outside the box? That's often the key! While the Structured Query Language remains the lingua Franca of data, there are some views of the world that are best rendered with the benefit of NoSQL engines. As usual, that's easier said than done. How can your organization migrate from a structured query to unstructured or semi-structured query language?
Register for this episode of The Briefing Room to find out! Veteran Analyst Dr. Robin Bloor will provide a detailed assessment of serious considerations when using NoSQL engines in conjunction with SQL. He'll be briefed by Ryan Millay of IBM Cloudant, who will showcase his company's solution, and how it's addressing the more vexing challenges facing today's information managers.
Visit InsideAnalysis.com for more information.
This is a high level presentation I delivered at BIWA Summit. It's just some high level thoughts related to today's NoSQL and Hadoop SQL engines (not deeply technical).
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
RDBMS to NoSQL: Practical Advice from Successful MigrationsScyllaDB
When and how to migrate data from SQL to NoSQL are matters of much debate. It can certainly be a daunting task, but when your SQL systems hit architectural limits or your Aurora expenses skyrocket, it’s probably time to consider the move.
See a discussion of how best to migrate data from SQL to NoSQL, and how to get heterogenous data systems to communicate with each other effectively in real time. Get important architectural considerations, tips and tricks and several real-world use cases.
From this webinar you will learn:
Key differences between RDBMS and NoSQL, and how to know when it’s time to migrate
How to harness the greatest strengths out of both classes of databases, SQL and NoSQL
Migration techniques proven in the field
Modeling differences between RDBMS and NoSQL
Managing releases in NoSQL vs RDBMS
Scylla features and services that help with migrating from a relational database
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...Zahid Anwar (OCM)
Common Cloud Control deployments can sometimes be exposed to single points of failure. In this presentation we will be discussing these pitfalls and how, through deploying Cloud Control within the Maximum Availability Architecture can provide a robust system. Aimed at a technical audience - we will dive into giving High Availability and Disaster Recovery for the OMS repository and OMS Web Tier through the use of RAC, Web Tier Clustering, Data Guard and Storage Replication. We will take our audience through the simple but effective steps required for this type of deployment in addition to the license implications of using Maximum Availability Architecture including what Oracle give you for free under a restricted-use license. This presentation is based on a recent project completed by our speaker Zahid Anwar. This project saw Zahid provide Maximum Availability Architecture for Cloud Control which was monitoring 6, critical X4-2 Eighth Exadata Machines.
Healthcare Claim Reimbursement using Apache SparkDatabricks
Optum Inc helps hospitals accurately calculate the claim reimbursement, detect underpayment from the Insurance company. Optum receives millions of claims per day which needs to be evaluated in less than 8 hours and the results need to be sent back to the hospitals for revenue recovery purposes.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
Similar to Netflix's Transition to High-Availability Storage (QCon SF 2010) (20)
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
Learn about Espresso, Databus, and Voldemort. LinkedIn Data Infrastructure Slides (Version 2). This talk was given in NYC on June 20, 2012
You can download the slides as PPT in order to see the transitions here :
http://bit.ly/LfH6Ru
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
3. Why Are You Here?
”What I need is an exact list of specific unknown
problems we might encounter."
-- anonymous
@r39132 - #netflixcloud 3
4.
5. Motivation
Circa late 2008, Netflix had a single data center
Single-point-of-failure (a.k.a. SPOF)
Approaching limits on cooling, power, space, traffic
capacity
Alternatives
Build more data centers
Outsource the majority of our capacity planning and
scale out
@r39132 - #netflixcloud 5
6. Motivation
Winner : Outsource the majority of our capacity planning and
scale out
Leverage a leading Infrastructure-as-a-service provider
Amazon Web Services
Footnote : As it has taken us a while (i.e. ~2+ years) to realize
our vision of running on the cloud, we needed a interim solution
to handle growth
We did build a second data center along the way
We did outgrow it
6@r39132 - #netflixcloud
7.
8. Cloud Migration Strategy
Components
Applications and Software Infrastructure
Data
Migration Considerations
Security
PII and PCI DSS stays in our DC, rest can go to the cloud
Scalability and Availability for Business Success
@r39132 - #netflixcloud 8
9. Cloud Migration Strategy
Scalability and Availability for Business Success
High Growth or High Traffic Growth Data
Video starts, Personalized Video choosing
High Traffic Growth Applications
Same as above
Log Processing
Time-to-market Critical Batch Processing
Video encoding
Not Included
DVD inventory and shipment
We are a streaming company that also ships DVD
@r39132 - #netflixcloud 9
10. Cloud Migration Strategy
Examples of Data that can be moved
Video-centric data
Critics’ reviews
Metadata
User-video-centric data – some of our largest data sets
User-video queue
Previously streamed and shipped video history
Ratings (i.e. a 5-star rating system)
Video streaming metadata (e.g. streaming bookmarks)
@r39132 - #netflixcloud 10
11.
12. Cloud Migration Strategy
High-level Requirements for our Site
No big-bang migrations
New functionality needs to launch in the cloud when
possible
High-level Requirements for our Data
Data needs to migrate before applications
Data needs to be shared between applications running in
the cloud and our data center during the transition period
@r39132 - #netflixcloud 12
14. Cloud Migration Strategy
Low-level Requirements for our Data
Pick a (key-value) data store in the cloud
Challenges
Translate RDBMS concepts to KV store concepts
Work-around Issues specific to the chosen KV store
Create a bi-directional DC-Cloud data replication
pipeline
@r39132 - #netflixcloud 14
15.
16. Pick a Data Store in the Cloud
An ideal storage solution should have the following features:
Hosted
Managed Distribution Model
Works in AWS
AP from CAP
Handles a majority of use-cases accessing high-growth, high-traffic data
Specifically, key access by customer id, movie id, or both
@r39132 - #netflixcloud 16
17. Pick a Data Store in the Cloud
We picked SimpleDB and S3
SimpleDB was targeted as the AP equivalent of our RDBMS
databases in our Data Center
S3 was used for data sets where item or row data
exceeded SimpleDB limits and could be looked up purely
by a single key (i.e. does not require secondary indices and
complex query semantics)
Video encodes
Streaming device activity logs (i.e. CLOB, BLOB, etc…)
Compression of old Rental History
@r39132 - #netflixcloud 17
18.
19. Technology Overview : SimpleDB
SimpleDB Hash Table Relational Databases
Domain Hash Table Table
Item Entry Row
Item Name Key Mandatory Primary Key
Attribute Part of the Entry Value Column
@r39132 - #netflixcloud 19
Terminology
20. Technology Overview : SimpleDB
@r39132 - #netflixcloud 20
Soccer Players
Key Value
ab12ocs12v9 First Name = Harold Last Name = Kewell
Nickname = Wizard of
Oz
Teams = Leeds United,
Liverpool, Galatasaray
b24h3b3403b First Name = Pavel Last Name = Nedved
Nickname = Czech
Cannon
Teams = Lazio,
Juventus
cc89c9dc892 First Name = Cristiano Last Name = Ronaldo
Teams = Sporting,
Manchester United,
Real Madrid
SimpleDB’s salient characteristics
• SimpleDB offers a range of consistency options
• SimpleDB domains are sparse and schema-less
• The Key and all Attributes are indexed
• Each item must have a unique Key
• An item contains a set of Attributes
• Each Attribute has a name
• Each Attribute has a set of values
• All data is stored as UTF-8 character strings (i.e. no support for types such as numbers or dates)
21. Technology Overview : SimpleDB
What does the API look like?
Manage Domains
CreateDomain
DeleteDomain
ListDomains
DomainMetaData
Access Data
Retrieving Data
GetAttributes – returns a single item
Select – returns multiple items using SQL syntax
Writing Data
PutAttributes – put single item
BatchPutAttributes – put multiple items
Removing Data
DeleteAttributes – delete single item
BatchDeleteAttributes – delete multiple items
@r39132 - #netflixcloud 21
22. Technology Overview : SimpleDB
@r39132 - #netflixcloud 22
Options available on reads and writes
Consistent Read
Read the most recently committed write
May have lower throughput/higher latency/lower
availability
Conditional Put/Delete
i.e. Optimistic Locking
Useful if you want to build a consistent multi-master data
store – you will still require your own anti-entropy
We do not use this currently, so we don’t know how it
performs
23.
24. Translate RDBMS Concepts to Key-Value Store
Concepts
Relational Databases are known for relations
First, a quick refresher on Normal forms
@r39132 - #netflixcloud 24
25. Normalization
NF1 : All occurrences of a record type must contain the same number of
fields -variable repeating fields and groups are not allowed
NF2 : Second normal form is violated when a non-key field is a fact about
a subset of a key
Violated here
Fixed here
@r39132 - #netflixcloud 25
Part Warehouse Quantity Warehouse-
Address
Part Warehouse Quantity Warehouse Warehouse-
Address
26. Normalization
Issues
Wastes Storage
The warehouse address is repeated for every Part-WH pair
Update Performance Suffers
If the address of the warehouse changes, I must update
many Part-WH pairs
Data inconsistencies possible
I can update the warehouse address for one Part-WH pair
and miss Parts for the same WH
Data Loss Possible
If at some point in time there are no parts, the WH address
will be lost
@r39132 - #netflixcloud 26
27. Normalization
RDBMS KV Store migrations can’t simply accept
denormalization!
Especially many-to-many and many-to-one entity relationships
Instead, pick your data set candidates carefully!
Keep relational data in RDBMS
Move key-look-ups to KV stores
Luckily for Netflix, most data is accessed by Customer, Video,
or both : i.e. Key Lookups
@r39132 - #netflixcloud 27
28. Translate RDBMS Concepts to Key-Value Store
Concepts
Aside from relations, relational databases typically
offer the following:
Transactions
Locks
Sequences
Triggers
Clocks
A structured query language (i.e. SQL)
Database server-side coding constructs (i.e. PL/SQL)
Constraints
@r39132 - #netflixcloud 28
29. Translate RDBMS Concepts to Key-Value Store
Concepts
Partial or no SQL support. Loosely-speaking, SimpleDB supports a
subset of SQL
BEST PRACTICE
Do GROUP BY and JOIN operations in the application layer
involving smallish data sets
No relations between domains
BEST PRACTICE
Compose relations in the application layer
No transactions
BEST PRACTICE
Use SimpleDB’s Optimistic Concurrency Control API: ConditionalPut
and ConditionalDelete
@r39132 - #netflixcloud 29
30. Translate RDBMS Concepts to Key-Value Store
Concepts
No schema - This is non-obvious. A query for a misspelled attribute
name will not fail with an error
BEST PRACTICE
Implement a schema validator in a common data access layer
No sequences
BEST PRACTICE
Sequences are often used as primary keys
In this case, use a naturally occurring unique key
If no naturally occurring unique key exists, use a UUID
Sequences are also often used for ordering
Use a distributed sequence generator
@r39132 - #netflixcloud 30
31. Translate RDBMS Concepts to Key-Value Store
Concepts
No clock operations, PL/SQL, Triggers
BEST PRACTICE
Do without
No constraints. Specifically,
No uniqueness constraints
No foreign key or referential constraints
No integrity constraints
BEST PRACTICE
Read Repair and Anti-entropy processes using Conditional
Put/Delete
@r39132 - #netflixcloud 31
32.
33. Work-around Issues specific to the chosen KV
store
Missing / Strange Functionality
No back-up and recovery
No native support for types (e.g. Number, Float, Date, etc…)
You cannot update one attribute and null out another one for an
item in a single API call
Mis-cased or misspelled attribute names in operations fail silently.
Why is SimpleDB case-sensitive?
Neglecting "limit N" returns a subset of information. Why does the
absence of an optional parameter not return all of the data?
Users need to deal with data set partitioning
Beware of Nulls
Poor Performance
@r39132 - #netflixcloud 33
34. Work-around Issues specific to the chosen KV
store
No Native Types – Sorting, Inequalities Conditions,
etc…
Since sorting is lexicographical, if you plan on sorting by certain
attributes, then
zero-pad logically-numeric attributes
e.g. –
000000000000000111111 this is bigger
000000000000000011111
use Joda time to store logical dates
e.g. –
2010-02-10T01:15:32.864Z this is more recent
2010-02-10T01:14:42.864Z
@r39132 - #netflixcloud 34
35. Work-around Issues specific to the chosen KV
store
Anti-pattern : Avoid the anti-pattern Select SOME_FIELD_1 from
MY_DOMAIN where SOME_FIELD_2 is null as this is a full domain
scan
Nulls are not indexed in a sparse-table
BEST PRACTICE
Instead, replace this check with a (indexed) flag column
called IS_FIELD_2_NULL: Select SOME_FIELD_1 from
MY_DOMAIN where IS_FIELD_2_NULL = 'Y'
Anti-pattern : When selecting data from a domain and sorting by an
attribute, items missing that attribute will not be returned
In Oracle, rows with null columns are still returned
BEST PRACTICE
Use a flag column as shown previously
@r39132 - #netflixcloud 35
36. Work-around Issues specific to the chosen KV
store
BEST PRACTICE : Aim for high index selectivity when you formulate
your select expressions for best performance
SimpleDB select performance is sensitive to index selectivity
Index Selectivity
Definition : # of distinct attribute values in specified attribute /
# of items in domain
e.g. Good Index Selectivity (i.e. 1 is the best)
A table having 100 records and one of its indexed column
has 88 distinct values, then the selectivity of this index is
88 / 100= 0.88
e.g. Bad Index Selectivity
lf an index on a table of 1000 records had only 5 distinct
values, then the index's selectivity is 5 / 1000 = 0.005
@r39132 - #netflixcloud 36
37. Work-around Issues specific to the chosen KV
store
Sharding Domains
There are 2 reasons to shard domains
You are trying to avoid running into one of the sizing limits
e.g. 10GB of space or 1 Billion Attributes
You are trying to scale your writes
To scale your writes further, use BatchPutAttributes and
BatchDeleteAttributes where possible
@r39132 - #netflixcloud 37
38.
39. Create a Bi-directional DC-Cloud Data
Replication Pipeline
Home-grown Data Replication Framework known as IR for Item
Replication
2 schemes in use currently
Polls the main table (a.k.a. Simple IR)
Doesn’t capture deletes but easy to implement
Polls a journal table that is populated via a trigger on the
main table (a.k.a. Trigger-journaled IR)
Captures every CRUD, but requires the development
of triggers
@r39132 - #netflixcloud 39
41. Create a Bi-directional DC-Cloud Data
Replication Pipeline
How often do we poll Oracle?
Every 5 seconds
What does the poll query look like?
select *
from QLOG_0
where LAST_UPDATE_TS > :CHECKPOINT Get recent
and LAST_UPDATE_TS < :NOW_MINUS_30s Exclude
most recent
order by LAST_UPDATE_TS Process in order
@r39132 - #netflixcloud 41
42. Create a Bi-directional DC-Cloud Data
Replication Pipeline
Data Replication Challenges & Best Practices
SimpleDB throttles traffic aggressively via 503 HTTP Response
codes (“Service Unavailable”)
With Singleton writes, I see 70-120 write TPS/domain
IR
Shard domains (i.e. partition data sets) to work-around these limits
Employs Slow ramp up
Uses BatchPutAttributes instead of (Singleton) PutAttributes call
Exercises an exponential bounded-back-off algorithm
Uses attribute-level replace=false when fork-lifting data
@r39132 - #netflixcloud 42
44. Create a Bi-directional DC-Cloud Data
Replication Pipeline
Data Replication Challenges & Best Practices
Implementing Multi-mastering and an Eventually-consistent
Replication Pipeline
SimpleDB offers optimistic concurrency control in the form of
conditional put (and deletes)
For our data, it is ok to be “consistent, but not accurate”
With this relaxation, we do not need to be concerned with
synchronizing logical clocks
We simply just need to ensure that each conditional put puts a large
strictly increasing value into the “version” column
@r39132 - #netflixcloud 44
Editor's Notes
Existing functionality needs to move in phases
Limits the risk and exposure to bugs
Limits conflicts with new product launches
Existing functionality needs to move in phases
Limits the risk and exposure to bugs
Limits conflicts with new product launches
Existing functionality needs to move in phases
Limits the risk and exposure to bugs
Limits conflicts with new product launches
Dynamo storage doesn’t suffer from this!
This is an issue with any SQL-like Query layer over a Sparse-data model. It can happen in other technologies.
Cannot treat SimpleDB like a black-box for performance critical applications.
We found the write availability was affected by the right partitioning scheme. We use a combination of forwarding tables and modulo addressing
Mention trickle lift
We like that it is available, hosted, and managed.
We don’t like the performance issues
We are looking into Cassandra and other KV stores