Spark auf Hadoop ist hochskalierbar. Cloud Computing ist hochskalierbar. R, die erweiterbare Open Source Data Science Software, eher nicht. Aber was passiert, wenn wir Spark auf Hadoop, Cloud Computing und den Microsoft R Server zu einer skalierbaren Data Science-Plattform zusammenfügen? Stellen Sie sich vor wie es sein könnte, wenn Sie das Erkunden, Transformieren und Modellieren von Daten in jeder beliebigen Größe aus Ihrer Lieblings-R-Umgebung durchführen könnten. Stellen Sie sich nun vor, wie man anschließend die erzeugten Modelle - mit wenigen Klicks - als skalierbare, cloud basierte Web-Services-API bereitstellt. In dieser Session zeigt Sascha Dittmann, wie Sie Ihren R-Code, tausende von Open-Source-R-Pakete sowie die verteilte Implementierungen der beliebtesten Maschine-Learning-Algorithmen nutzen können, um genau dies umzusetzen. Dabei zeigt er wie man ein HDInsight Spark-Cluster inkl. eines Microsoft R Server-Clusters erstellt, sowie das daraus entstandene Model im SQL Server oder als swagger-based API für Anwendungsentwickler bereitstellt.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax
DataStax provides modern, feature-rich, and highly tunable client libraries for C/C++, C#, Java, Node.js, Python, PHP, and Ruby that work with any cluster size no matter if deployed across multiple on premise or cloud datacenters.
Come learn right from the source about the DataStax drivers for Apache Cassandra and DSE and how they can help you build continuously available, fault tolerant, and instantly responsive applications.
About the Speakers
Alex Popescu Senior Product Manager, DataStax
I'm a developer turned product manager building developer tools for Apache Cassandra and DSE. With an eye for simplicity, I focus on creating friendly developer solutions that enable building high-performance, scalable, and fault tolerant applications. I'm passionate about open source and over years I made numerous contributions to major projects like TestNG and Groovy.
Bulat Shakirzyanov Architect, DataStax
Bulat Shakirzyanov, a.k.a. avalance123, is a software alchemist who holds a black belt in test-fu. Open source enthusiast, author of and contributor to several popular open source projects, he also loves talking about clean code, open source, unix, distributed systems, consensus algorithms and himself in third person.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax
Cassandra is moving away from Thrift to CQL protocol. Along with this change, Thrift based client drivers are not actively supported any more, nor are they exposed to new Cassandra features. This brings many existing Cassandra users into a situation that they need to migrate their Thrift based application to CQL based. A complete solution to Thrift-to-CQL migration requires changes in 3 areas: application code, data model, and existing data. In this session we will focus on how we can migrate existing data effectively in order to reflect data model changes and give you the necessary tools to build from what you have learned to apply it in your environments.
Learning Objectives:
1) Gain an insight of what Cassandra storage engine looks like for version 2.2 and downward
2) Get better idea of how Thrift and CQL difference affects table design
3) Explore an approach to effectively migrate dynamically generated (Thrift) data into a static defined (CQL) table
About the Speaker
Yabin Meng Apache Cassandra / DataStax Enterprise Consultant, Pythian
Yabin is a DataStax certified Architect, Administrator, and Developer. He has been in IT industry for more than 15 years and much of his career is around database related technologies. He has been working with Cassandra for about 2 years and is currently a Cassandra/DSE consultant at Pythian.
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
Database sharding involves spreading database contents across multiple servers, with each server holding only part of the database. While it is possible to vertically scale Postgres, and to scale read-only workloads across multiple servers, only sharding allows multi-server read-write scaling. This presentation will cover the advantages of sharding and future Postgres sharding implementation requirements, including foreign data wrapper enhancements, parallelism, and global snapshot and transaction control. This is a followup to my Postgres Scaling Opportunities presentation.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax
DataStax provides modern, feature-rich, and highly tunable client libraries for C/C++, C#, Java, Node.js, Python, PHP, and Ruby that work with any cluster size no matter if deployed across multiple on premise or cloud datacenters.
Come learn right from the source about the DataStax drivers for Apache Cassandra and DSE and how they can help you build continuously available, fault tolerant, and instantly responsive applications.
About the Speakers
Alex Popescu Senior Product Manager, DataStax
I'm a developer turned product manager building developer tools for Apache Cassandra and DSE. With an eye for simplicity, I focus on creating friendly developer solutions that enable building high-performance, scalable, and fault tolerant applications. I'm passionate about open source and over years I made numerous contributions to major projects like TestNG and Groovy.
Bulat Shakirzyanov Architect, DataStax
Bulat Shakirzyanov, a.k.a. avalance123, is a software alchemist who holds a black belt in test-fu. Open source enthusiast, author of and contributor to several popular open source projects, he also loves talking about clean code, open source, unix, distributed systems, consensus algorithms and himself in third person.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax
Cassandra is moving away from Thrift to CQL protocol. Along with this change, Thrift based client drivers are not actively supported any more, nor are they exposed to new Cassandra features. This brings many existing Cassandra users into a situation that they need to migrate their Thrift based application to CQL based. A complete solution to Thrift-to-CQL migration requires changes in 3 areas: application code, data model, and existing data. In this session we will focus on how we can migrate existing data effectively in order to reflect data model changes and give you the necessary tools to build from what you have learned to apply it in your environments.
Learning Objectives:
1) Gain an insight of what Cassandra storage engine looks like for version 2.2 and downward
2) Get better idea of how Thrift and CQL difference affects table design
3) Explore an approach to effectively migrate dynamically generated (Thrift) data into a static defined (CQL) table
About the Speaker
Yabin Meng Apache Cassandra / DataStax Enterprise Consultant, Pythian
Yabin is a DataStax certified Architect, Administrator, and Developer. He has been in IT industry for more than 15 years and much of his career is around database related technologies. He has been working with Cassandra for about 2 years and is currently a Cassandra/DSE consultant at Pythian.
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
Database sharding involves spreading database contents across multiple servers, with each server holding only part of the database. While it is possible to vertically scale Postgres, and to scale read-only workloads across multiple servers, only sharding allows multi-server read-write scaling. This presentation will cover the advantages of sharding and future Postgres sharding implementation requirements, including foreign data wrapper enhancements, parallelism, and global snapshot and transaction control. This is a followup to my Postgres Scaling Opportunities presentation.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email.
Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath
Composable Data Processing with Apache SparkDatabricks
As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
Designing & Optimizing micro batch processing system to handle multi-billion events using 100+ nodes of Cassandra , spark and Kafka - Lessons learned from the trenches
Designing and Optimizing 20+ billion operations a day presents a set of complex challenges especially when the SLA is near real-time. In this presentation we will walk through our experience in building large scale event processing pipeline using Cassandra , spark streaming and kafka using 100+ nodes. We will present the Design patterns, development steps and diagnostics setups at the technology level and application level that are needed to manage the application of this scale. We also aim to present some unique problems we encountered in optimizing and operationalizing these environments.
About the Speakers
Ananth Ram Senior Principal / Senior Manager, Accenture
Ananth Ram is a Solution Architect with over 17 years of experience in Oracle database Architecture and designing large scale applications. He was with Oracle Corp for nine years before joining Accenture as Senior Principal . As a part of Accenture, Ananth has been working on many large scale Oracle and big data initiatives in the last four years.
Rich Rein Solution Architect, DataStax
Rich Rein is a Solutions Architect from DataStax on Accenture team with over 30+ years as an architect, manager, and consultant in Silicon Valley's computing industry.
Rumeel Kazi, Accenture Federal
Rumeel Kazi is a Senior Manager in the Accenture Health & Public Service (H&PS) practice. He has over 17 years of Systems Integration implementation experience involving Oracle, J2EE platforms, Enterprise Application Integration, Supply Chain, ETL and Business Rules Management Systems. Rumeel has been working on large scale Oracle and big data application solutions since the last 5 years.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Databricks
We describe design and implementation of Cognitive Database, a Spark-based relational database that demonstrates novel capabilities of AI-enabled SQL queries. A key aspect of our approach is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. We seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries.
CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies/semantic clustering, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. We demonstrate unique capabilities of Cognitive Databases using an Apache Spark 2.2.0 based prototype to execute inductive reasoning CI queries over a multi-modal relational database containing text and images from the ImageNet dataset. We illustrate key aspects of the Spark-based implementation, e.g., UDF implementations of various cognitive functions using Spark SQL, Python (via Jupyter notebook) and Scala based interfaces, Distributed Spark implementation, and integration of GPU-enabled nearest neighbor kernels.
We also discuss a variety of real-world use cases from different application domains. Further details of this system can be found in the Arxiv paper: https://arxiv.org/abs/1712.07199
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
The concept of "Data Lake" is in everyone's mind today. The idea of storing all the data that accumulates in a company in a central location and making it available sounds very interesting at first. But Data Lake can quickly turn from a clear, beautiful mountain lake into a huge pond, especially if it is inexpertly entrusted with all the source data formats that are common in today's enterprises, such as XML, JSON, CSV or unstructured text data. Who, after some time, still has an overview of which data, which format and how they have developed over different versions? Anyone who wants to help themselves from the Data Lake must ask themselves the same questions over and over again: what information is provided, what data types do they have and how has the content changed over time?
Data serialization frameworks such as Apache Avro and Google Protocol Buffer (Protobuf), which enable platform-independent data modeling and data storage, can help. This talk will discuss the possibilities of Avro and Protobuf and show how they can be used in the context of a data lake and what advantages can be achieved. The support on Avro and Protobuf by Big Data and Fast Data platforms is also a topic.
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
Event: TDWI Accelerate, Seattle, Oct 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Tags: R, Spark, SQL Server
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email.
Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath
Composable Data Processing with Apache SparkDatabricks
As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
Designing & Optimizing micro batch processing system to handle multi-billion events using 100+ nodes of Cassandra , spark and Kafka - Lessons learned from the trenches
Designing and Optimizing 20+ billion operations a day presents a set of complex challenges especially when the SLA is near real-time. In this presentation we will walk through our experience in building large scale event processing pipeline using Cassandra , spark streaming and kafka using 100+ nodes. We will present the Design patterns, development steps and diagnostics setups at the technology level and application level that are needed to manage the application of this scale. We also aim to present some unique problems we encountered in optimizing and operationalizing these environments.
About the Speakers
Ananth Ram Senior Principal / Senior Manager, Accenture
Ananth Ram is a Solution Architect with over 17 years of experience in Oracle database Architecture and designing large scale applications. He was with Oracle Corp for nine years before joining Accenture as Senior Principal . As a part of Accenture, Ananth has been working on many large scale Oracle and big data initiatives in the last four years.
Rich Rein Solution Architect, DataStax
Rich Rein is a Solutions Architect from DataStax on Accenture team with over 30+ years as an architect, manager, and consultant in Silicon Valley's computing industry.
Rumeel Kazi, Accenture Federal
Rumeel Kazi is a Senior Manager in the Accenture Health & Public Service (H&PS) practice. He has over 17 years of Systems Integration implementation experience involving Oracle, J2EE platforms, Enterprise Application Integration, Supply Chain, ETL and Business Rules Management Systems. Rumeel has been working on large scale Oracle and big data application solutions since the last 5 years.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Databricks
We describe design and implementation of Cognitive Database, a Spark-based relational database that demonstrates novel capabilities of AI-enabled SQL queries. A key aspect of our approach is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. We seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries.
CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies/semantic clustering, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. We demonstrate unique capabilities of Cognitive Databases using an Apache Spark 2.2.0 based prototype to execute inductive reasoning CI queries over a multi-modal relational database containing text and images from the ImageNet dataset. We illustrate key aspects of the Spark-based implementation, e.g., UDF implementations of various cognitive functions using Spark SQL, Python (via Jupyter notebook) and Scala based interfaces, Distributed Spark implementation, and integration of GPU-enabled nearest neighbor kernels.
We also discuss a variety of real-world use cases from different application domains. Further details of this system can be found in the Arxiv paper: https://arxiv.org/abs/1712.07199
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
The concept of "Data Lake" is in everyone's mind today. The idea of storing all the data that accumulates in a company in a central location and making it available sounds very interesting at first. But Data Lake can quickly turn from a clear, beautiful mountain lake into a huge pond, especially if it is inexpertly entrusted with all the source data formats that are common in today's enterprises, such as XML, JSON, CSV or unstructured text data. Who, after some time, still has an overview of which data, which format and how they have developed over different versions? Anyone who wants to help themselves from the Data Lake must ask themselves the same questions over and over again: what information is provided, what data types do they have and how has the content changed over time?
Data serialization frameworks such as Apache Avro and Google Protocol Buffer (Protobuf), which enable platform-independent data modeling and data storage, can help. This talk will discuss the possibilities of Avro and Protobuf and show how they can be used in the context of a data lake and what advantages can be achieved. The support on Avro and Protobuf by Big Data and Fast Data platforms is also a topic.
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
Event: TDWI Accelerate, Seattle, Oct 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Tags: R, Spark, SQL Server
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
SQL Server Machine Learning Services is an embedded, predictive analytics and data science engine that can execute R and Python code within a SQL Server database as stored procedures, as T-SQL script containing R or Python statements, or as R or Python code containing T-SQL.
The key value proposition of Machine Learning Services is the power of its proprietary packages to deliver advanced analytics at scale, and the ability to bring calculations and processing to where the data resides, eliminating the need to pull data across the network.
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto,
Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi.
In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark
By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together.
R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics.
This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS .
- How to Scale R
- Work with R and Hadoop + Spark
-Demo of MLS on HDP/HDInsight server with RStudio
- How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio:
Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
Presented by David Smith and Michael Helbraun to the Portland R User Group, November 13, 2013
http://www.meetup.com/portland-r-user-group/events/147311372/
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
microsoft r server for distributed computingBAINIDA
microsoft r server for distributed computing กฤษฏิ์ คำตื้อ,
Technical Evangelist,
Microsoft (Thailand)
ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
Microsoft R enable enterprise-wide, scalable experimental data science and operational machine learning, by providing a collection of servers and tools that extend the capabilities of open-source R In these slides, we give a quick introduction to Microsoft R Server architecture, and a comprehensive overview of ScaleR, the core libraries to Microsoft R, that enables parallel execution and use external data frames (xdfs). A tutorial-like presentation covering how to: 1) setup the environments, 2) read data, 3) process & transform, 4) analyse, summarize, visualize, 5) learn & predict, and finally 6) deploy and consume (using msrdeploy).
eRum2016 -RevoScaleR - Performance and Scalability RŁukasz Grala
Conference eRum2016.
European R users meeting (eRum) is an international conference that aims at integrating users of the R language. eRum 2016 will be a good chance to exchange experiences, broaden knowledge on R and collaborate. One can participate in eRum 2016: (1) with a regular oral presentation, (2) with a lightning talk, (3) with a poster presentation, (4) or attending without presentation or poster. Due to space available at the conference venue, organizers set limit of participants at 250.
Session about RevoScale R.
Presentation given by US Chief Scientist, Mario Inchiosa, at the June 2013 Hadoop Summit in San Jose, CA.
ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
These slides were presented on a Software Craftsmanship meetup @ EPAM Hungary on 26 January, 2017.
During the talk we went through the evolution of structured data analytics in Spark. We compared the RDD, the SparkSQL (DataFrame) and the DataSet APIs. We used the very latest and greatest Spark 2.1, released on December 28, went through code samples and dove deep into Spark optimizations. The code samples can be downloaded from here: https://github.com/symat/spark-api-comparison
Similar to Microsoft R - Data Science at Scale (20)
Wer heutzutage in Big Data Projekten involviert ist, kommt an Hadoop-Tools, wie beispielsweise Hive, Spark oder Storm kaum noch vorbei.
Doch wie sieht es mit den Microsoft Bordmittel, genauer gesagt Visual Studio, C# und SQL, aus? Dieser Frage geht Sascha Dittmann in dieser Session nach.
Er zeigt anhand von Codebeispielen, und den Azure Data Lake Diensten, wie eine Umsetzung auch mit reinen Microsoft Tools möglich ist.
Hochskalierbare, relationale Datenbanken in Microsoft AzureSascha Dittmann
Die meisten Softwareprojekte fangen mit einer einfachen relationalen Datenbank an. Doch was soll man machen, wenn die Anforderungen an die Datenbank wachsen? Was, wenn die Datenbankgröße und die Anzahl der gleichzeitigen Abfragen stetig wächst? Was, wenn aus einzelnen Datenbanken ein Data Warehouse entsteht? Sascha Dittmann zeigt in dieser Session wie solche und ähnliche Szenarien mit Microsoft Azure umsetzbar sind. Er stellt dabei Technologien wie Azure SQL Database, Azure Elastic Pools und Azure SQL Data Warehouse im Detail vor und gibt Beispiele, sowie Tipps & Tricks aus der Praxis.
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSascha Dittmann
Seit dem SQL Server 2000 hielt Stück für Stück die XML-Unterstützung Einzug in die Microsoft RDBMS Welt.
Mit der Azure DocumentDB kam die zweite, hauseigene NoSQL-Datenbank in der Microsoft Cloud hinzu, welche die Daten im JSON-Format verarbeitet.
In dieser Session werden wir anhand eines Praxisbeispiels step-by-step, d.h. von den vorbereitenden Schritten, über das Schreiben bis hin zum Lesen der Daten, diese beiden Technologien gegenüberstellen.
Dabei arbeiten wir die Vor- und Nachteile der einzelnen Ansätze heraus und zeigen Best Practices auf.
Wer hoch-skalierbare Anwendungen in Microsoft Azure bauen wollte, kam bislang an den Azure Cloud Services nicht vorbei. Doch diese Technologie hatte ihre Ecken und Kanten. Die an der BUILD 2015 veröffentlichte Azure Service Fabric Preview soll hierbei Abhilfe schaffen. Mit ihr sollen Entwickler schneller und unkomplizierter Anwendungen für die Cloud, wie auch fürs eigene Rechenzentrum, konstruieren können. Sascha Dittmann durfte bereits seit Ende letzen Jahres mit dem Dienst Erfahrungen sammeln und stellt diesen, anhand einiger Code-Beispiele, in dieser Session vor.
SQL Saturday #313 Rheinland - MapReduce in der PraxisSascha Dittmann
Das Programmiermodel MapReduce, welches vor einigen Jahren von Google veröffentlicht wurde, hat Einzug in zahlreiche Systeme erhalten. Dabei wurde es sowohl als eigenständiges System, wie beispielweise bei Hadoop, Disco oder Amazon Elastic MapReduce, aber auch als Abfragesprache innerhalb größerer Systeme, wie beispielweise bei MongoDB, Greenplum DB oder Aster Data, implementiert. Diese Session stellt gängige Problemstellungen aus der Praxis vor und wie diese mit dem MapReduce Framework von Microsoft HDInsight umgesetzt werden können.
In 2010 stellten die Entwickler von Hadoop fest, dass bei sehr große Clustern (4.000 Knoten und mehr) das bisherige MapReduce Framework nicht mehr richtig skaliert.
Deshalb wurde dieses komplett überarbeitet.
Das Ergebnis war YARN (Yet Another Resource Negotiator).
Neben einer besseren Skalierbarkeit erzeugte YARN weitere positive Nebeneffekte.
Im Oktober 2013 wurde YARN mit dem Hadoop 2.0 Release veröffentlicht.
Was es mit YARN auf sich hat - und welche zusätzlichen Änderungen in Hadoop 2.0 eingeflossen sind - zeigt diese Session.
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
Im zweiten Teil unserer Microsoft Big Data Session geht es darum, wie Big Data Informationen über "klassisches" SQL zugänglich gemacht werden können und wie sich mit der neuen PolyBase-Engine unstrukturierte Hadoop-Daten mit relationalen Data Warehouse-Daten einfach verknüpfen lassen.
In der Hadoop-Welt wird der SQL-Zugriff über die Komponente Hive ermöglicht.
Über den Microsoft Hive ODBC-Konnektor können die üblichen BI-Tools, wie PowerPivot, diesen Zugriff direkt nutzen.
Die PolyBase-Engine schließlich wird ein Bestandteil des SQL Server 2012 Parallel Data Warehouse werden und erlaubt einem transparenten SQL-Zugriff, egal, wo sich die Daten befinden.
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
In dieser Session stellen wir anhand eines praktischen Szenarios vor, wie konkrete Aufgabenstellungen mit HDInsight in der Praxis gelöst werden können:
- Grundlagen von HDInsight für Windows Server und Windows Azure
- Mit Windows Azure HDInsight arbeiten
- MapReduce-Jobs mit Javascript und .NET Code implementieren
Developer Open Space 2012 - Cloud Computing WorkshopSascha Dittmann
Der Einbezug von Cloud Computing in die tägliche Entwicklungsarbeit gewinnt weiter an Bedeutung. Ob hybride Ansätze oder Greenfield-Engineering, Windows Azure bietet die entsprechenden Building Blocks um eine Anforderung zielführend umzusetzen. In dem Workshop werden wir gemeinsam eine klassische on-premise Anwendung „cloudifizieren“ und Schritt für Schritt Cloud-Ressourcen zielführend einsetzen.
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...Sascha Dittmann
Wir leben in einem Datenzeitalter! Nach Schätzungen in 2006, betrug das weltweite "Datenuniversum" ca. 0.18 Zettabytes (1 ZB => 10 hoch 21 Bytes bzw. 1 Mrd. Terabytes). In 2011 hatte sich dieses Volumen sogar verzehnfacht (1,8 Zettabytes). Somit wird in vielen Anwendungsszenarien das Thema Big Data und Big Processing immer wichtiger.
Klassische relationale Datenbanksysteme, sowie Statistik und Visualisierungstools, sind oft nicht in der Lage, derart große Datenmengen zu verarbeiten. Für Big Data kommt daher eine neue Art von Software zum Einsatz, die massiv parallel auf bis zu hunderten oder tausenden von Prozessoren bzw. Servern arbeitet.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
2. What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
3. Challenges posed by open source R
?
?
Lack of
Commercial
Support
Inadequate
Modeling
Performance
Complex
Deployment
Processes
Limited
Data
Scale
4. R from Microsoft brings
Peace of
mind
Efficiency Speed and
scalability
Flexibility
and agility
6. Convergence with Flexibility
Scalable Algorithms
R: Write Once Deploy Anywhere
Templates & Samples
Microsoft R Server Family
R & Python to AML Interop.
Cortana Intelligence
11. DI
R+CRAN
MicrosoftR
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Delivers High Performance Parallel Distributed
Analytics Across Individual and Clustered Systems
• Cloudera
• Hortonworks
• MapR
• Apache Spark
• IBM Platform LSF
• Microsoft HPC
Clusters
• SQL Server
• Teradata
Database
• Red Hat
• SuSE Servers
• Windows
DistributeR
12. ### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
hdfsFS
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
localFS <- RxNativeFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = localFS)
Local Parallel processing – Linux or Windows In – Hadoop
ScaleR models can be deployed from a server or edge node to run in Hadoop
without any functional R model re-coding for map-reduce
Compute
context R script
– sets where the
model will run
Functional
model R script –
does not need
to change to run
in Hadoop
Copyright Microsoft Corporation. All rights reserved.
13.
14.
15.
16.
17. ▪ Data import – Delimited, Fixed, SAS, SPSS,
OBDC
▪ Variable creation & transformation
▪ Recode variables
▪ Factor variables
▪ Missing value handling
▪ Sort, Merge, Split
▪ Aggregate by category (means, sums)
▪ Min / Max, Mean, Median (approx.)
▪ Quantiles (approx.)
▪ Standard Deviation
▪ Variance
▪ Correlation
▪ Covariance
▪ Sum of Squares (cross product matrix for set
variables)
▪ Pairwise Cross tabs
▪ Risk Ratio & Odds Ratio
▪ Cross-Tabulation of Data (standard tables & long
form)
▪ Marginal Summaries of Cross Tabulations
▪ Chi Square Test
▪ Kendall Rank Correlation
▪ Fisher’s Exact Test
▪ Student’s t-Test
▪ Subsample (observations & variables)
▪ Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
▪ Sum of Squares (cross product matrix for set
variables)
▪ Multiple Linear Regression
▪ Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
▪ Covariance & Correlation Matrices
▪ Logistic Regression
▪ Classification & Regression Trees
▪ Predictions/scoring for models
▪ Residuals for all models
Predictive Models ▪ K-Means
▪ Decision Trees
▪ Decision Forests
▪ Gradient Boosted Decision Trees
▪ Naïve Bayes
Cluster Analysis
Classification
Simulation
Variable Selection
▪ Stepwise Regression
▪ Simulation (e.g. Monte Carlo)
▪ Parallel Random Number Generation
Combination
▪ rxDataStep
▪ rxExec
▪ PEMA-R API Custom Algorithms