Presentation about facebook's chat architecture which explains how erlang was used to build a high scaling chat. This is an extract from - http://www.erlang-factory.com/upload/presentations/31/EugeneLetuchy-ErlangatFacebook.pdf
Note that facebook moved out of erlang as it was difficult to find the programmers and a few other small problems that forced to reimplement this with C++ on thrift.
A preview into SQL Server 2019 from Bob, Asad and my presentation at PASS Summit 2018 (Nov '18). We provided insights into what our public preview builds for SQL Server 2019 had in November.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
This topic will cover the intermediate understanding of Google Big Query and how Media Prima Digital utilizing Big Query as data warehouse for production.
Aligner vos données avec Wikidata grâce à l'outil Open RefineGautier Poupeau
Tutoriel sous la forme d'un pas à pas pour aligner des données avec Wikidata grâce à l'outil Open Refine. Dans ce tutoriel, les données alignées proviennent de la plateforme HAL récupérées via le Sparql endpoint.
Presentation about facebook's chat architecture which explains how erlang was used to build a high scaling chat. This is an extract from - http://www.erlang-factory.com/upload/presentations/31/EugeneLetuchy-ErlangatFacebook.pdf
Note that facebook moved out of erlang as it was difficult to find the programmers and a few other small problems that forced to reimplement this with C++ on thrift.
A preview into SQL Server 2019 from Bob, Asad and my presentation at PASS Summit 2018 (Nov '18). We provided insights into what our public preview builds for SQL Server 2019 had in November.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
This topic will cover the intermediate understanding of Google Big Query and how Media Prima Digital utilizing Big Query as data warehouse for production.
Aligner vos données avec Wikidata grâce à l'outil Open RefineGautier Poupeau
Tutoriel sous la forme d'un pas à pas pour aligner des données avec Wikidata grâce à l'outil Open Refine. Dans ce tutoriel, les données alignées proviennent de la plateforme HAL récupérées via le Sparql endpoint.
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
This presentation provides an overview of Cloudera and how a modern platform for Machine Learning and Analytics better enables a data-driven enterprise.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
PostgreSQL 15 and its Major Features -(Aakash M - Mydbops) - Mydbops Opensour...Mydbops
PostgreSQL 15 was released on October 13. This presentation on the major features and improvements in PostgreSQL 15, and how they can help for better performance and brings more reliability.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Nubank embraces company-wide data usage as an important competitive advantage.
In this presentation I give an overview of our current data infrastructure, how we create ETL jobs, what data tools we use and which internal services are provided by the data team.
This was originally presented on the São Paulo Machine Learning Meetup, December 2018; the deck is in english
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Architecting Agile Data Applications for ScaleDatabricks
Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world so that nearly every streaming framework now supports higher level relational operations.
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in an enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story?
We discuss the drivers and expected benefits of changing the existing event processing systems. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
This presentation provides an overview of Cloudera and how a modern platform for Machine Learning and Analytics better enables a data-driven enterprise.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
PostgreSQL 15 and its Major Features -(Aakash M - Mydbops) - Mydbops Opensour...Mydbops
PostgreSQL 15 was released on October 13. This presentation on the major features and improvements in PostgreSQL 15, and how they can help for better performance and brings more reliability.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Nubank embraces company-wide data usage as an important competitive advantage.
In this presentation I give an overview of our current data infrastructure, how we create ETL jobs, what data tools we use and which internal services are provided by the data team.
This was originally presented on the São Paulo Machine Learning Meetup, December 2018; the deck is in english
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Architecting Agile Data Applications for ScaleDatabricks
Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world so that nearly every streaming framework now supports higher level relational operations.
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in an enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story?
We discuss the drivers and expected benefits of changing the existing event processing systems. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
What happens when you transform your threat hunt playbooks from static step-by-step guides to something more dynamic? What if instead of copying and pasting code and queries from a document you could execute blocks of code from within the same framework as your text and notes? Notebook technologies have emerged largely from the data science community and have a direct application to the security domain.
We will show data science examples applied to threat hunting that involve interfacing with data from across the data landscape … one notebook, multiple data sources.
https://events.secureworldexpo.com/agenda/seattle-wa-2018/
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
By Sudheesh Katkam
PyData New York City 2017
Dremio is a new open source project for self-service data fabric. Dremio simplifies and accelerates access to data from any source and any size, including relational databases, NoSQL, Hadoop, Parquet, and text files. We'll show you how you can use Dremio to visually curate data from any source, then access via Pandas or Jupyter notebook for rapid access.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Supercharging Data Performance for Real-Time Data Analysis Ryft
The velocity and volume of data are growing faster than ever before, and companies are looking for new methods to speed their data analytics. Using an innovative FPGA-based architecture, the Ryft ONE supercharges data analytics and provides you more value from your data.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
In the space of building products with data, either by dealing with huge amounts of data or by applying machine learning, many different ecosystems meet. Larger volumes of data have to be passed between these systems. The handling of the data is not only down to divide between systems written in Java that need to pass it on to the machine learning model in Python. When you take into account that you want to integrate with the existing business infrastructure, you also need to cater for legacy systems as well do you need to bring the large volumes of data to the user via UIs.
Similar to Ursa Labs and Apache Arrow in 2019 (20)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
7. Much of the data science stack’s computational
foundation is severely dated, rooted in 1980s /
1990s FORTRAN-style semantics
Single-core /
single-threaded
algorithms
Naïve execution
model, eager
evaluation
Primitive memory
management,
expensive data access
Fragmented language
ecosystems,
“Proprietary” memory
models …
8. We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation
(LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed
9. • Open source project founded in 2016 by key developers of 13
major open source data projects
• Key ideas
• Language agnostic, open standard in-memory format for
columnar data (aka “data frames”)
• Bring together database and data science communities to
collaborate on shared computational technology
• 3 years old, over 200 unique contributors, > 1 million monthly
installs
10. Defragmenting Data Access
Up to 80-90% of CPU
cycles spent on
de/serialization
Life without Arrow Life with Arrow
No de/serialization
11. The Arrow Development Platform
• Open source library stack offering some level of support for 11
different programming languages
• Focus
• Reuse of runtime data and algorithms without copying or
serialization
• Fast data access (storage systems, file formats)
• Efficient data interchange (IPC, RPC)
• Accelerated In-memory computing
• Foundation of new systems, while accelerating existing ones
12. Worse Patterns Better Patterns
Custom data structures
Copy and convert
Custom algorithms
Custom file formats
Custom wire protocols
(Open)
Standard data structures
Zero copy
Standard algorithms
Standard file formats
Standard wire protocols
13. Analytic database architecture
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Vertically integrated /
“Black Box”
● Internal components do
not have a public API
● Users interact with front
end
14. Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Components have public
APIs
● Use what you need
● Different front ends can
be developed
16. Arrow Use Cases
● Data access
○ Read and write widely used storage formats
○ Interact with database protocols, other data sources
● Data movement
○ Zero-copy interprocess communication
○ Efficient RPC / client-server communications
● Computation libraries
○ Efficient in-memory / out-of-core data frame-type analytics
○ LLVM-compilation for vectorized expression evaluation
17. Some problems relevant to pandas users
• Memory-mapping large on-disk datasets
• Efficient string processing
• Chunked / non-contiguous tables
• Native nested types (structs, arrays, unions)
• Efficient interchange with other systems
18. Example: Arrow-accelerated Python + Apache Spark
● Joint work with Li Jin from Two
Sigma, Bryan Cutler from IBM
● Vectorized user-defined functions,
fast data export to pandas
import pandas as pd
from scipy import stats
@pandas_udf('double')
def cdf(v):
return pd.Series(stats.norm.cdf(v))
df.withColumn('cumulative_probability',
cdf(df.v))
19. Example: Arrow-accelerated Python + Apache Spark
Spark SQL
Arrow Columnar
Stream Input
PySpark Worker
Zero copy via socket pandas
Arrow Columnar
Stream Output
to arrow
from arrow
from arrow
to arrow
22. 2019 Ursa Labs Development Agenda
● File format ingest/export
● Arrow RPC: “Flight” Framework
● Gandiva: LLVM-based expression compiler
● In-memory Columnar Query Engine
● Language interop: Python and R
● Cloud file-system support
23. 2018 Accomplishments
• 3 major releases (0.9, 0.10, 0.11)
• 1600+ resolved JIRA issues
• 7 codebase donations
• Major engineering efforts
• Improved CI / CD tooling; packaging automation for releases
• Bootstrap R library development
• C++ CSV reader
• Combine Arrow and Parquet C++ codebases
• GPU support library
25. Arrow Flight RPC Framework
• Key idea: standardized high performance data transport
• A gRPC-based framework for defining custom data services that
send and receive Arrow columnar data natively
• Uses Protocol Buffers v3 for client protocol
• Pluggable command execution layer, authentication
• Low-level gRPC optimizations (~ 10x faster than comparables)
• Write Arrow memory directly onto outgoing gRPC buffer
• Avoid any copying or deserialization
26. Arrow Flight - Efficient gRPC transport
Client
DoGet
Data Node
FlightData
Row
Batch
Row
Batch
Row
Batch
Row
Batch
Row
Batch
...
Data transported in a Protocol
Buffer, but reads can be made
zero-copy by writing a custom
gRPC “deserializer”
27. Gandiva, LLVM-powered expression compiler
• Initially developed by Dremio, donated to Apache
Arrow
• Efficient evaluation of projections, filters, and
aggregates
• Uses LLVM for runtime code generation
• Dremio using to accelerate a Java-based distributed
SQL engine
28. Using Gandiva from Java with zero-copy
SELECT year(timestamp), month(timestamp), …
FROM table
...
Input Table
Fragment
Arrow Java
JNI (Zero-copy)
Evaluate
Gandiva
LLVM
Function
Arrow C++
Result Table
Fragment
29. Cloud Service Support
● Support data engineering workflows on AWS, GCP,
Azure
● Optimized IO for cloud blob storage (S3, GCS, etc.)
● In C++, so can be used in Python, R, Ruby, etc.
30. Looking ahead
• 2019 likely to be year of rapid growth for Apache
Arrow
• Grow community, diversity of language and scope
• Join us: https://github.com/apache/arrow