Give you a brief overview of the product. - What is esProc SPL? And show some cases helping you to know what it uses for. Talk about why esProc works better. And overview its brief characteristics. After that, Introduce the main technical solutions which esProc is often used.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
This document provides an overview of Amazon Redshift presented by Pavan Pothukuchi and Chris Liu. The agenda includes an introduction to Redshift, its benefits, use cases, and Coursera's experience using Redshift. Some key benefits highlighted are that Redshift is fast, inexpensive, fully managed, secure, and innovates quickly. Example use cases from NTT Docomo and Nasdaq are discussed. Chris Liu then discusses Coursera's experience moving from no data warehouse to using Redshift over three years, including their current ecosystem involving Redshift, other AWS services, and business intelligence applications. Lessons learned around thinking in Redshift, communicating with users, surprises, and reflections are also shared.
This document provides an overview and comparison of relational and NoSQL databases. Relational databases use SQL and have strict schemas while NoSQL databases are schema-less and include document, key-value, wide-column, and graph models. NoSQL databases provide unlimited horizontal scaling, very fast performance that does not deteriorate with growth, and flexible queries using map-reduce. Popular NoSQL databases include MongoDB, Cassandra, HBase, and Redis.
Learn how Aerospike's Hybrid Memory Architecture brings transactions and analytics together to power real-time Systems of Engagement ( SOEs) for companies across AdTech, financial services, telecommunications, and eCommerce. We take a deep dive into the architecture including use cases, topology, Smart Clients, XDR and more. Aerospike delivers predictable performance, high uptime and availability at the lowest total cost of ownership (TCO).
Modernizing Your Database with SQL Server 2019 discusses SQL Server 2019 features that can help modernize a database, including:
- The Hybrid Buffer Pool which supports persistent memory to improve performance on read-heavy workloads.
- Memory-Optimized TempDB Metadata which stores TempDB metadata in memory-optimized tables to avoid certain blocking issues.
- Intelligent Query Processing features like Adaptive Query Processing, Batch Mode processing on rowstores, and Scalar UDF Inlining which improve query performance.
- Approximate Count Distinct, a new function that provides an estimated count of distinct values in a column faster than a precise count.
- Lightweight profiling, enabled by default, which provides query plan
AMIS organiseerde op maandagavond 15 juli het seminar ‘Oracle database 12c revealed’. Deze avond bood AMIS Oracle professionals de eerste mogelijkheid om de vernieuwingen in Oracle database 12c in actie te zien! De AMIS specialisten die meer dan een jaar bèta testen hebben uitgevoerd lieten zien wat er nieuw is en hoe we dat de komende jaren gaan inzetten!
Deze presentatie is deze avond gegeven als een plenaire sessie!
An AMIS Overview of Oracle database 12c (12.1)Marco Gralike
Presentation used by Lucas Jellema and Marco Gralike during the AMIS Oracle Database 12c Launch event on Monday the 15th of July 2013 (much thanks to Tom Kyte, Oracle, for being allowed to use some of his material)
M.
A quick overview of Redshift and common use-cases. Followed by tools and links to performance tuning. How Redshift fits in the AWS data services. A list of key new features since last meetup in September 2016, including Redshift Spectrum that allows one to run SQL directly on your data sitting on Amazon S3. It also includes Redshift echosystem with data integration, bi, consultancy and data modelling partners.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
This document provides an overview of Amazon Redshift presented by Pavan Pothukuchi and Chris Liu. The agenda includes an introduction to Redshift, its benefits, use cases, and Coursera's experience using Redshift. Some key benefits highlighted are that Redshift is fast, inexpensive, fully managed, secure, and innovates quickly. Example use cases from NTT Docomo and Nasdaq are discussed. Chris Liu then discusses Coursera's experience moving from no data warehouse to using Redshift over three years, including their current ecosystem involving Redshift, other AWS services, and business intelligence applications. Lessons learned around thinking in Redshift, communicating with users, surprises, and reflections are also shared.
This document provides an overview and comparison of relational and NoSQL databases. Relational databases use SQL and have strict schemas while NoSQL databases are schema-less and include document, key-value, wide-column, and graph models. NoSQL databases provide unlimited horizontal scaling, very fast performance that does not deteriorate with growth, and flexible queries using map-reduce. Popular NoSQL databases include MongoDB, Cassandra, HBase, and Redis.
Learn how Aerospike's Hybrid Memory Architecture brings transactions and analytics together to power real-time Systems of Engagement ( SOEs) for companies across AdTech, financial services, telecommunications, and eCommerce. We take a deep dive into the architecture including use cases, topology, Smart Clients, XDR and more. Aerospike delivers predictable performance, high uptime and availability at the lowest total cost of ownership (TCO).
Modernizing Your Database with SQL Server 2019 discusses SQL Server 2019 features that can help modernize a database, including:
- The Hybrid Buffer Pool which supports persistent memory to improve performance on read-heavy workloads.
- Memory-Optimized TempDB Metadata which stores TempDB metadata in memory-optimized tables to avoid certain blocking issues.
- Intelligent Query Processing features like Adaptive Query Processing, Batch Mode processing on rowstores, and Scalar UDF Inlining which improve query performance.
- Approximate Count Distinct, a new function that provides an estimated count of distinct values in a column faster than a precise count.
- Lightweight profiling, enabled by default, which provides query plan
AMIS organiseerde op maandagavond 15 juli het seminar ‘Oracle database 12c revealed’. Deze avond bood AMIS Oracle professionals de eerste mogelijkheid om de vernieuwingen in Oracle database 12c in actie te zien! De AMIS specialisten die meer dan een jaar bèta testen hebben uitgevoerd lieten zien wat er nieuw is en hoe we dat de komende jaren gaan inzetten!
Deze presentatie is deze avond gegeven als een plenaire sessie!
An AMIS Overview of Oracle database 12c (12.1)Marco Gralike
Presentation used by Lucas Jellema and Marco Gralike during the AMIS Oracle Database 12c Launch event on Monday the 15th of July 2013 (much thanks to Tom Kyte, Oracle, for being allowed to use some of his material)
M.
A quick overview of Redshift and common use-cases. Followed by tools and links to performance tuning. How Redshift fits in the AWS data services. A list of key new features since last meetup in September 2016, including Redshift Spectrum that allows one to run SQL directly on your data sitting on Amazon S3. It also includes Redshift echosystem with data integration, bi, consultancy and data modelling partners.
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
The general consensus among Java developers has evolved from a dogmatic strive for database independence to a much more pragmatic wish to leverage the power of the database. This session demonstrates some of the (hidden) powers of the database and how these can be utilized from Java applications using either straight JDBC or working through JPA. The Oracle database is used as example: SQL for Aggregation and Analysis, Flashback Queries for historical comparison and trends, Virtual Private Database, complex validation, PL/SQL and collections for bulk data manipulation, view and instead-of triggers for data model morphing, server push of relevant data changes, edition based redefinition for release management.
- overview of role of database in JEE architecture (and a little history on how the database is perceived through the years)
- discussion on the development of database functionality
- demonstration of some powerful database features
- description of how we leveraged these features in our JSF (RichFaces)/JPA (Hibernate) application
- demo of web application based on these features
- discussion on how to approach the database
Databases in the Cloud discusses AWS database services for moving workloads to the cloud. It describes Amazon Relational Database Service (RDS) which provides several fully managed relational database options including MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Amazon Aurora. It also discusses non-relational database services like DynamoDB, ElastiCache, and Redshift for analytics workloads. The document provides guidance on choosing between SQL and NoSQL databases and discusses benefits of managed database services over hosting databases on-premises or in EC2 instances.
This document summarizes the key points from a presentation on SQL Server 2016. It discusses in-memory and columnstore features, including performance gains from processing data in memory instead of on disk. New capabilities for real-time operational analytics are presented that allow analytics queries to run concurrently with OLTP workloads using the same data schema. Maintaining a columnstore index for analytics queries is suggested to improve performance.
This document summarizes key differences between front-end applications like Access and the SQL Server backend. It also provides overviews of SQL Server transactions, server architecture including protocols and components, how select and update requests are processed, and uses of dynamic management views.
The document discusses scalable storage systems and key-value stores as an alternative to traditional databases. It provides an overview of vertical and horizontal scalability. Traditional databases are not well-suited for scalable systems due to their complexity, wasted features, and multi-step query processing. Key-value stores offer simpler data models and interfaces that are designed from the start for scaling across hundreds of machines. Performance comparisons show key-value stores significantly outperforming traditional databases. The document also outlines how key-value storage systems work at the aggregation and storage layers.
Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for 1/10th the traditional cost. This session will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs. We’ll also cover the recently announced Redshift Spectrum, which allows you to query unstructured data directly from Amazon S3.
The thinking persons guide to data warehouse designCalpont
The document discusses key considerations for designing a data warehouse, including building a logical design, transitioning to a physical design, and monitoring and tuning the design. It recommends using a modeling tool to capture logical designs, manual partitioning in some cases, and letting database engines do the work. It also covers physical design decisions like SQL vs NoSQL, row vs column storage, partitioning, indexing and optimizing data loads. Regular monitoring of workloads, bottlenecks and ratios is advised to tune performance.
Azure Data Lake Analytics provides a big data analytics service for processing large amounts of data stored in Azure Data Lake Store. It allows users to run analytics jobs using U-SQL, a language that unifies SQL with C# for querying structured, semi-structured and unstructured data. Jobs are compiled, scheduled and run in parallel across multiple Azure Data Lake Analytics Units (ADLAUs). The key components include storage, a job queue, parallelization, and a U-SQL runtime. Partitioning input data improves performance by enabling partition elimination and parallel aggregation of query results.
1. The document discusses various technologies for building big data architectures, including NoSQL databases, distributed file systems, and data partitioning techniques.
2. Key-value stores, document databases, and graph databases are introduced as alternatives to relational databases for large, unstructured data.
3. The document also covers approaches for scaling databases horizontally, such as sharding, replication, and partitioning data across multiple servers.
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
There are a number of NewSQL products now on market such as VoltDB and Progres-XL. These promise NoSQL performance and scalability but with ACID and relational concepts implemented with ANSI SQL.
This session will cover off why NoSQL came about, why it's had it's day and why NewSQL will become the backbone of the Enterprise for OLTP and Analytics.
The document provides an overview of the Google Cloud Platform (GCP) Data Engineer certification exam, including the content breakdown and question format. It then details several big data technologies in the GCP ecosystem such as Apache Pig, Hive, Spark, and Beam. Finally, it covers various GCP storage options including Cloud Storage, Cloud SQL, Datastore, BigTable, and BigQuery, outlining their key features, performance characteristics, data models, and use cases.
The Adventure: BlackRay as a Storage Enginefschupp
- BlackRay is an in-memory relational database and search engine that supports SQL and fulltext search. It was originally developed in 2005 for a phone directory with over 80 million records.
- The presentation discusses BlackRay's architecture, data loading process, indexing performance, and support for transactions, clustering, and APIs. It also describes efforts to implement BlackRay as a MySQL storage engine.
- Going forward, the team aims to improve SQL support, add security features, and explore using BlackRay as a backend for other applications like LDAP directories.
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
Today, Yahoo! uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore.
As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data.
In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.
This document discusses how to implement operations like selection, joining, grouping, and sorting in Cassandra without SQL. It explains that Cassandra uses a nested data model to efficiently store and retrieve related data. Operations like selection can be performed by creating additional column families that index data by fields like birthdate and allow fast retrieval of records by those fields. Joining can be implemented by nesting related entity data within the same column family. Grouping and sorting are also achieved through additional indexing column families. While this requires duplicating data for different queries, it takes advantage of Cassandra's strengths in scalable updates.
A quick tour in 16 slides of Amazon's Redshift clustered, massively parallel database.
Find out what differentiates it from the other database products Amazon has, including SimpleDB, DynamoDB and RDS (MySQL, SQL Server and Oracle).
Learn how it stores data on disk in a columnar format and how this relates to performance and interesting compression techniques.
Contrast the difference between Redshift and a MySQL instance and discover how the clustered architecture may help to dramatically reduce query time.
Amazon Redshift é um serviço gerenciado que lhe dá um Data Warehouse, pronto para usar. Você se preocupa com carregar dados e utilizá-lo. Os detalhes de infraestrutura, servidores, replicação, backup são administrados pela AWS.
The document describes the results of a proof of concept test comparing the performance of Oracle Database 12c In-Memory on Oracle SPARC M7 servers versus Intel servers. The test loaded and populated database tables with the TPC-H schema and benchmark and measured query throughput and response times under increasing user load. The SPARC M7 servers achieved significantly higher query throughput and lower response times compared to the Intel servers, demonstrating the benefits of the dedicated acceleration engines built into the SPARC M7 chip for in-memory processing.
(1) Amazon Redshift is a fully managed data warehousing service in the cloud that makes it simple and cost-effective to analyze large amounts of data across petabytes of structured and semi-structured data. (2) It provides fast query performance by using massively parallel processing and columnar storage techniques. (3) Customers like NTT Docomo, Nasdaq, and Amazon have been able to analyze petabytes of data faster and at a lower cost using Amazon Redshift compared to their previous on-premises solutions.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
The general consensus among Java developers has evolved from a dogmatic strive for database independence to a much more pragmatic wish to leverage the power of the database. This session demonstrates some of the (hidden) powers of the database and how these can be utilized from Java applications using either straight JDBC or working through JPA. The Oracle database is used as example: SQL for Aggregation and Analysis, Flashback Queries for historical comparison and trends, Virtual Private Database, complex validation, PL/SQL and collections for bulk data manipulation, view and instead-of triggers for data model morphing, server push of relevant data changes, edition based redefinition for release management.
- overview of role of database in JEE architecture (and a little history on how the database is perceived through the years)
- discussion on the development of database functionality
- demonstration of some powerful database features
- description of how we leveraged these features in our JSF (RichFaces)/JPA (Hibernate) application
- demo of web application based on these features
- discussion on how to approach the database
Databases in the Cloud discusses AWS database services for moving workloads to the cloud. It describes Amazon Relational Database Service (RDS) which provides several fully managed relational database options including MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Amazon Aurora. It also discusses non-relational database services like DynamoDB, ElastiCache, and Redshift for analytics workloads. The document provides guidance on choosing between SQL and NoSQL databases and discusses benefits of managed database services over hosting databases on-premises or in EC2 instances.
This document summarizes the key points from a presentation on SQL Server 2016. It discusses in-memory and columnstore features, including performance gains from processing data in memory instead of on disk. New capabilities for real-time operational analytics are presented that allow analytics queries to run concurrently with OLTP workloads using the same data schema. Maintaining a columnstore index for analytics queries is suggested to improve performance.
This document summarizes key differences between front-end applications like Access and the SQL Server backend. It also provides overviews of SQL Server transactions, server architecture including protocols and components, how select and update requests are processed, and uses of dynamic management views.
The document discusses scalable storage systems and key-value stores as an alternative to traditional databases. It provides an overview of vertical and horizontal scalability. Traditional databases are not well-suited for scalable systems due to their complexity, wasted features, and multi-step query processing. Key-value stores offer simpler data models and interfaces that are designed from the start for scaling across hundreds of machines. Performance comparisons show key-value stores significantly outperforming traditional databases. The document also outlines how key-value storage systems work at the aggregation and storage layers.
Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for 1/10th the traditional cost. This session will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs. We’ll also cover the recently announced Redshift Spectrum, which allows you to query unstructured data directly from Amazon S3.
The thinking persons guide to data warehouse designCalpont
The document discusses key considerations for designing a data warehouse, including building a logical design, transitioning to a physical design, and monitoring and tuning the design. It recommends using a modeling tool to capture logical designs, manual partitioning in some cases, and letting database engines do the work. It also covers physical design decisions like SQL vs NoSQL, row vs column storage, partitioning, indexing and optimizing data loads. Regular monitoring of workloads, bottlenecks and ratios is advised to tune performance.
Azure Data Lake Analytics provides a big data analytics service for processing large amounts of data stored in Azure Data Lake Store. It allows users to run analytics jobs using U-SQL, a language that unifies SQL with C# for querying structured, semi-structured and unstructured data. Jobs are compiled, scheduled and run in parallel across multiple Azure Data Lake Analytics Units (ADLAUs). The key components include storage, a job queue, parallelization, and a U-SQL runtime. Partitioning input data improves performance by enabling partition elimination and parallel aggregation of query results.
1. The document discusses various technologies for building big data architectures, including NoSQL databases, distributed file systems, and data partitioning techniques.
2. Key-value stores, document databases, and graph databases are introduced as alternatives to relational databases for large, unstructured data.
3. The document also covers approaches for scaling databases horizontally, such as sharding, replication, and partitioning data across multiple servers.
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
There are a number of NewSQL products now on market such as VoltDB and Progres-XL. These promise NoSQL performance and scalability but with ACID and relational concepts implemented with ANSI SQL.
This session will cover off why NoSQL came about, why it's had it's day and why NewSQL will become the backbone of the Enterprise for OLTP and Analytics.
The document provides an overview of the Google Cloud Platform (GCP) Data Engineer certification exam, including the content breakdown and question format. It then details several big data technologies in the GCP ecosystem such as Apache Pig, Hive, Spark, and Beam. Finally, it covers various GCP storage options including Cloud Storage, Cloud SQL, Datastore, BigTable, and BigQuery, outlining their key features, performance characteristics, data models, and use cases.
The Adventure: BlackRay as a Storage Enginefschupp
- BlackRay is an in-memory relational database and search engine that supports SQL and fulltext search. It was originally developed in 2005 for a phone directory with over 80 million records.
- The presentation discusses BlackRay's architecture, data loading process, indexing performance, and support for transactions, clustering, and APIs. It also describes efforts to implement BlackRay as a MySQL storage engine.
- Going forward, the team aims to improve SQL support, add security features, and explore using BlackRay as a backend for other applications like LDAP directories.
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
Today, Yahoo! uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore.
As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data.
In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.
This document discusses how to implement operations like selection, joining, grouping, and sorting in Cassandra without SQL. It explains that Cassandra uses a nested data model to efficiently store and retrieve related data. Operations like selection can be performed by creating additional column families that index data by fields like birthdate and allow fast retrieval of records by those fields. Joining can be implemented by nesting related entity data within the same column family. Grouping and sorting are also achieved through additional indexing column families. While this requires duplicating data for different queries, it takes advantage of Cassandra's strengths in scalable updates.
A quick tour in 16 slides of Amazon's Redshift clustered, massively parallel database.
Find out what differentiates it from the other database products Amazon has, including SimpleDB, DynamoDB and RDS (MySQL, SQL Server and Oracle).
Learn how it stores data on disk in a columnar format and how this relates to performance and interesting compression techniques.
Contrast the difference between Redshift and a MySQL instance and discover how the clustered architecture may help to dramatically reduce query time.
Amazon Redshift é um serviço gerenciado que lhe dá um Data Warehouse, pronto para usar. Você se preocupa com carregar dados e utilizá-lo. Os detalhes de infraestrutura, servidores, replicação, backup são administrados pela AWS.
The document describes the results of a proof of concept test comparing the performance of Oracle Database 12c In-Memory on Oracle SPARC M7 servers versus Intel servers. The test loaded and populated database tables with the TPC-H schema and benchmark and measured query throughput and response times under increasing user load. The SPARC M7 servers achieved significantly higher query throughput and lower response times compared to the Intel servers, demonstrating the benefits of the dedicated acceleration engines built into the SPARC M7 chip for in-memory processing.
(1) Amazon Redshift is a fully managed data warehousing service in the cloud that makes it simple and cost-effective to analyze large amounts of data across petabytes of structured and semi-structured data. (2) It provides fast query performance by using massively parallel processing and columnar storage techniques. (3) Customers like NTT Docomo, Nasdaq, and Amazon have been able to analyze petabytes of data faster and at a lower cost using Amazon Redshift compared to their previous on-premises solutions.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
3. What is esProc SPL?
• Analysis Database
• Computing and processing of structured and semi-
structured data
• Offline batch job, online query
• neither SQL system nor NoSQL Technology
• Self created SPL syntax, more concise and efficient
SPL: Structured Process Language
4. What pain points does esProc SPL solve?
For the data computing scenarios : Offline Batch Job , Online Query/Report
Slow batch jobs can not fit in certain time window, being strained especially on critical summary dates
Being forced to wait for minutes for a query/report, the business personnel becomes angry
More concurrencies, longer the query time span, the database crashes
N-layer nested SQL or stored procedures of dozens of KBs, programmer himself is confused after three months
Dozens of datasources like RDB/NoSQL/File/json/Web/…, cross-source mixed computation is highly needed
Separate the hot data and cold data into different databases, it is hard to perform real-time queries on the whole data
Too much relied on the stored procedures, the application can not migrate, framework is hard to adjust
Too many intermediate tables in the database, exhausting the storage and resources, but dare not to delete them
Endless report demands in an enterprise, and how can the cost of personnel be relieved?
……
5. What are the counterpart technologies of esProc SPL?
Databases that use SQL syntax and are applied to OLAP scenarios
• Common database: MySQL, PostgreSQL, Oracle, DB2, …
• Data warehouse on Hadoop: Hive, Spark SQL, …
• Distributed data warehouse/MPP: …
• Cloud data warehouse: Snowflake, …
• All-in-one database machine: ExaData, …
esProc SPL: Lower cost, higher performance
6. What does esProc SPL bring?
SPL
SQL
esProc SPL: Reduce the development, hardware, operation and maintenance costs by X times
Development cost
Hardware cost
O&M cost
Lengthy nested code, difficult to write and debug
Huge computing loads consume resources
Simplified stepwise code, easy to write and debug
Low complexity algorithms reduce resource consumption
Heavy and closed computing ability leads to bloated framework Integrated and open computing ability forms lightweight framework
The description ability is insufficient, and complex logic
needs to be written in a circuitous way
Strong description ability, natural thinking to implement
complex logic
8. • Insurance policy table: 35 million rows,details table: 123 million rows
• There are various ways of association, which need to be handled separately
• Informix
• Historical policy matching 6672 seconds
• Codes 1800 lines
Batch job of insurance policies of an auto insurance company
T
A
S
K
Increase speed
6.5 times
Case
• Historical policy matching 1020 seconds
• Codes 500 lines
Case details: http://c.raqsoft.com/article/1644827119694
9. • SQL: 48 steps,3300 lines
• Historical data: 110 million rows,daily increase: 1.37 million rows
• Complex multi-table join
• AIX+DB2
• Calculation time: 1.5 hours
Batch job of loan agreements of a bank
T
A
S
K
Increase speed
8.5 times
Case
• Calculation time: 10 minutes,codes:
500 lines
Case details: http://c.raqsoft.com/article/1644215913288
10. • Huge number of users, and large concurrent accesses
• Branch information changes frequently and needs to be associated in time
• Commercial data warehouses on Hadoop cannot meet
the high concurrency requirements
• Using 6 ElasticSearch cluster can cope with concurrency,
but can not associate in real time. The data update time is
long, and the service must be stopped during this period.
Mobile banking: multi concurrent account query
1 server vs 6 servers
T
A
S
K
Case
• Single machine can cope with the same
concurrent volume as ES cluster
• Real time association, zero waiting time for
branch information update
Case details: http://c.raqsoft.com/article/1643533607375
11. • Too many labels, and hundreds of labels can be arbitrarily combined to query
• Association, filtering and aggregation calculation of a 20 million rows large table and even larger detailed
tables
• Each page involves the calculation of nearly 200 indexes, and 10 concurrency will cause the concurrent
calculation of more than 2000 indexes
• Oracle
• Unable to calculate in real time; The
query requirements have to be submitted
in advance, and the calculation is carried
out one day earlier.
Turn pre-calculation
into real-time calculation
T
A
S
K
Case Calculation of the number of unique loan clients of a bank
• 10 concurrency, 2000 indexes in total, less than 3
seconds
• No need to prepare in advance, instantly select any
label combination, and get query results in real time
Case details:http://c.raqsoft.com.cn/article/1593424083742
12. esProc can not only implement stored-procedure-like computations on Vertica, but also calculate different sources directly.
An insurance company-Outside database stored procedure
Vertica does not support stored procedures; To prepare data, complex nested SQL statements have to be written, and Java hardcoding is often required.
When mixed computing with MySQL, MySQL data has to be loaded into Vertica first, which is tedious, not real-time, and the database is bloated.
The best use for us is to pass parameters to the
Vertica database.
User comments
Each cell becomes a data array that are easy to use,
compare and manipulate. It is very logical and you
have made it user friendly.
Computing
layer
esProc-Outside database stored
procedures
Application
layer
Reporting (BIRT)
Data source
Vertica MySQL
A
ACCESS
13. esProc solves the problem of too long response time caused by high concurrent access and
large amount of data computing! Using esProc to put high-frequency hot data in front and build
a computing middle layer is the best solution, superior to the expensive database products!
A bank-CUBE reconstruction
Data source
Core database ODS
Storage
layer GP database
esProc
Data files
Computing
layer
Application
layer
Credit
report
Core
Report
Performance
report
Risk report
WEB Service
Report service
esProc
Computing nodes
Index Old method New method Improvement
Code quantity 3500 lines 80 lines 43 times
Workload Unknown 7 manday --
Performance 150-420s 1-3s 140 times
Maintenance Difficult Easy Qualitative change
Optimization Difficult Easy Qualitative change
After the introduction of esProc, the concurrency is greatly increased, and the purchase of expensive distributed databases is avoided.
15. SQL doesn’t support ordered operation sufficiently and doesn’t provide orderly grouping directly; Instead, four layers of nesting has to be used
in a roundabout way.
Such statements are not only difficult to write, but also difficult to understand.
In the face of complex business logic, the complexity of SQL will increase sharply, which is difficult to understand and write.
It isn’t an unusual requirement, and it appears everywhere in thousands of lines of SQL in reality, which reduces the efficiency of development
and maintenance severely.
SELECT MAX(ContinuousDays)
FROM (SELECT COUNT(*) ContinuousDays
FROM (SELECT SUM(UpDownTag) OVER ( ORDER BY TradeDate) NoRisingDays
FROM (SELECT TradeDate,
CASE WHEN Price>LAG(price) OVER ( ORDER BY TradeDate)
THEN 0 ELSE 1 END UpDownTag
FROM Stock ) )
GROUP BY NoRisingDays )
Why SQL is difficult to write: What is the max days has a stock been rising?
16. Why can‘t SQL run fast: Get the top 10 from 100 million rows of data
This query uses ORDER BY. If it is executed strictly according to this logic, it means sorting the full amount of data, and the performance will be poor.
We know that there is a way to perform this operation without full sorting, but SQL cannot describe it. We can only rely on the optimization engine of the database.
In simple cases (such as this statement), many databases can make the optimization, but if the situation is more complex, the database optimization engine will faint
In the following example, get the TopN from each group, SQL cannot describe it directly, and can only write it as a subquery using window function in a roundabout
approach.
In the face of this roundabout approach, the database optimization engine cannot do the optimization and can only perform sorting.
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn
FROM Orders )
WHERE rn<=10
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
17. The SPL solution
A
1 =Stock.sort(TradeDate).group@i(Price<Price[-1]).max(~.len())
The computing logic of this SPL is the same as that of the previous SQL, but SPL provides orderly grouping operation, which is intuitive and
concise.
A
1 =file(“Orders.ctx”).open().cursor()
2 =A1.groups(;top(10;Amount)) Top 10 orders
3 =A1.groups(Area;top(10;Amount)) Top 10 orders of each area
SPL regards TopN as the aggregation operation of returning a set, avoiding full sorting; The syntax is similar in case of whole set or groups,
and there is no need to use the roundabout approach.
18. The difficulties of SQL stem from relational algebra, and theoretical problems cannot be solved by engineering methods. Despite years of improvement, it is still
difficult to meet complex requirements.
SPL is based on a completely different theoretical system: discrete dataset. SPL provides more abundant data types and basic operations, and has more powerful
expression capabilities.
Ordinary people will do like this
1+2=3
3+3=6
6+4=10
10+5=15
15+6=21
21+7=28
…
Gauss does like this
1+100=101
2+99=101
…
A total of fifty 101
50*101= 5050
The smart Gauss came up with a
convenient and efficient algorithm. The key
here is:
Multiplicationused!
SQL is like an arithmetic system with only addition. The code is lengthy and the calculation is inefficient.
SPL is equivalent to the invention of multiplication! Simplify writing and improve performance.
Why is SPL more advanced?
Extended reading:
SPL: a database language featuring easy writing and fast running
【Analogy】Calculate1+2+3+…+100=?
19. 1. Complex orderly Computing:funnel analysis of user behavior transformation
• Calculate the user churn rate after each event (page browsing, search, shopping cart addition, order placement, payment, etc.)
• Multiple events are effective only when they are completed within a specified time window and occur in a specified order. It is very hard to implement it in SQL, not to mention optimize it.
2. Multi-step big data batch job
• Complex business requirements are difficult to implement directly in SQL, cursor reading is slow and difficult to calculate in parallel, wasting computing resources
• The implementation using stored procedure requires thousands of lines and tens of steps, with the repeated buffering of intermediate results, the batch job cannot be completed within
the specified time window
3. Multi index calculation on big data, repeated use and multiple associations
• Perform the calculation of hundreds of indexes at one time, and use the detailed data for many times, and association is also involved. SQL needs to traverse data repeatedly.
• Mixed calculation of large table association, conditional filtering, grouping and aggregation, and deduplication; Real-time calculation with high concurrency.
Common scenarios to beat SQL
In real business, complex SQL (and stored procedures) are often hundreds/thousands of lines, and a large number of roundabout approaches have to be used to
implement the calculation. The code becomes complex as well as the performance becomes low.
20. Funnel analysis of an E-commerce company
A
1 =["etype1","etype2","etype3"]
2 =file("event.ctx").open()
3
=A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime<date("2021-01-25") &&
A1.contain(etype) && …)
4 =A3.group(uid).(~.sort(etime))
5 =A4.new(~.select@1(etype==A1(1)):first,~:all).select(first)
6
=A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime,
null))))
7 =A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)
SQL lacks order-related calculations and is not completely set-oriented. It needs to detour into
multiple subqueries and repeatedly JOIN. It is difficult to write and understand, and the operation
performance is very low.
Due to space limitation, only a three-step funnel is listed here, and subqueries need to be added
when adding more steps.
SPL provides order-related calculations and is more thoroughly set-oriented. Code is written directly
according to natural thinking, which is simple and efficient.
This code can handle funnels with any number of steps, as long as the parameters are changed.
with e1 as (
select uid,1 as step1,min(etime) as t1
from event
where etime>= to_date('2021-01-10') and etime<to_date('2021-01-25')
and eventtype='eventtype1' and …
group by 1),
e2 as (
select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
from event as e2
inner join e1 on e2.uid = e1.uid
where e2.etime>= to_date('2021-01-10') and e2.etime<to_date('2021-01-25')
and e2.etime > t1 and e2.etime < t1 + 7
and eventtype='eventtype2' and …
group by 1),
e3 as (
select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
from event as e3
inner join e2 on e3.uid = e2.uid
where e3.etime>= to_date('2021-01-10') and e3.etime<to_date('2021-01-25')
and e3.etime > t2 and e3.etime < t1 + 7
and eventtype='eventtype3' and …
group by 1)
select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from e1
left join e2 on e1.uid = e2.uid
left join e3 on e2.uid = e3.uid
22. esProc Technical Architecture
Application layer
Storage
Hardware Platform
Integrated Computing Engine
Computing
X86/ARM/…
Algorithm
engine
Storage
engine
Multi-source Mixed
computing
Parallel
computing
Data solidification
1 2 3 4
6
7
Production
database
real-time data
8
SPL( High performance process computing )
5
23. Simple and easy-to-use development environment
Run, debug, run to cursor, step Set breakpoint
Simple syntax, in line with natural thinking, simpler than other high-level development
languages
What you see is
what you get cell
result ,easy to
debug; Convenient
to reference
intermediate result
Real-time system information output, check at any time for
abnormality
24. SPL is especially suitable for complex process operations.
Specially designed syntax system
Natural & clean step-by-step computation, direct reference of cell name without specifically defining a variable
25. Designed specifically for structured data tables
Rich class libraries
Grouping & Loop Sorting & Filtering
Set operations Ordered sets
26. esProc, developed in JAVA, can run independently or be seamlessly integrated into applications.
Excellent integration
Application program(Report/JAVA/WEB Service)
esProc IDE
Computing layer
SPL Script
esProc JDBC/ODBC/HTTP
Data source (RDB、NoSQL、TXT、CSV、JSON、Hadoop)
Interpreted execution,
support hot swap
27. Multiple data sources are directly used for mixed computations.
Support diverse data sources
Direct computation without loading data into database, using the strengths of original data sources, real time computation
28. Flexible and efficient file storage
High performance
Private data storage format: bin files, composite tables
File system
Support to store data by business classification in tree directory
BIN FILE Double increment segmentation supports arbitrary number parallelism
Self-owned efficient compression technique (reduced space; less CPU usage; secure)
Generic Storage, allowing set data
COMPOSITE
TABLE
Mixed row and column storage
Ordered storage improves compression rate and positioning performance
Efficient intelligent index
Direct file storage without database is more efficient, more flexible, and cheaper.
Double increment segmentation supports arbitrary number parallelism
Integration of main and sub table to reduce storage and association
Numbering keys to achieve efficient positioning Join
30. Replace stored procedures
Problems
Solution
• Stored procedures are hard to edit and debug, and lack migratability.
• Compiling stored procedures requires high privilege, causing poor security.
• The shared use of a stored procedure by multiple applications will cause tight coupling between applications.
SPL is intuitively suitable for complex multi-step data computation.
SPL scripts are naturally migratable.
The script only requires the read privilege of the database and will not
cause database security problems.
Scripts of different applications are stored in different directories, which
will not cause coupling between applications.
JDBC/RESTful interface
esProc
SPL scripts
(Outside-database stored procedures)
Database Stored procedures
Migration of stored procedures
Extended reading: http://c.raqsoft.com/article/1646637335078
31. Eliminate intermediate tables from databases
Problems
Solution
• For query efficiency or simplified development, a large number of intermediate tables are generated in the database.
• The intermediate tables take up large space, causing the database to be excessively redundant and bloated.
• The use of the same intermediate table by different applications will cause tight coupling, and it is difficult to manage the intermediate tables
(hard to delete).
The aim for storing intermediate tables in the database is to employ the
database’s computational ability for subsequent computations; SPL can
implement the subsequent computations after using file storage.
External intermediate tables (files) are easier to manage, and using different
directories for storage will not cause coupling problems between
applications.
External intermediate tables can fully reduce the load on the database, even
without the need to deploy the database.
JDBC/RESTful interface
esProc File storage
(External intermediate tables)
Database Intermediate tables
Extended reading: http://c.raqsoft.com/article/1652163420524
32. Handle endless report development requirements
Problems
Solution
• Reporting tools/BI tools can only solve the problems in the report presentation stage and can do nothing about data preparation.
• Data preparation implemented in SQL/stored procedure/JAVA hardcoding is difficult to develop and maintain, and the cost is high.
• The report development needs are objectively endless, and data preparation is the main factor leading to high development costs.
Add a computing layer between report presentation and data source to
solve the data preparation problems.
SPL simplifies the data preparation of reports, makes up for the lack of
computing ability of reporting tools, and comprehensively improves the
efficiency of report development.
Both report presentation and data preparation can quickly respond to
handle endless report development needs at low cost.
Report
Computing
Layer
Report
Presentation
Layer
Report application
Computing engine(esProc)
Fixed report Adhoc analysis
Extended reading: http://c.raqsoft.com/article/1642764157224
RDB NoSQL File
33. Mixed computation to implement real-time HTAP
• Mixed calculation of hot and cold data to implement T+0
real-time analysis
• Organized historical cold data
• Real time reading of transaction hot data
• The production system needs no modification
• Make full use of the advantages of the original multiple data sources
• HTAP databases are difficult to meet HTAP requirements
• Requiring replacement of production system database, with high risk
• Insufficient SQL computing power, insufficient historical data preparation, low performance
• Closed computing capacity, complex ETL process required due to external diverse data sources, poor real-time ability
Support low-risk, high-performance and strong real-time HTAP with open multi-
source hybrid computing capability
Production Database
Msg
RDB
NoSQL
IotDB
…
Application system
T+0 real-time
analysis
Cold
data
Transaction data
esProc
Extended reading: http://c.raqsoft.com.cn/article/1658456594393
34. • Open format file data computation
• txt/csv/xls/json/xml
• High performance private format file storage and computation
• Enter at will and sort out gradually;Lake to House
• Rich data source interfaces, direct real-time computing
Perform computation on files to implement Lakehouse
• RDB can only House, not Lake
• Strong constraints, non-compliant data cannot enter, and complex ETL processes are inefficient
• Closed, external data cannot be calculated, let alone mixed real-time calculation
Extended reading: http://c.raqsoft.com.cn/article/1659057410089
DB
NoSQL
IoT
WEB
… DB NoSQL IoT WEB
…
Unified Data Storage
Other Engines
Data Service
Data Source Data Lake
esProc
SPL
esProc High
Performance
Storage
High Performance
Computing
High Performance Storage
35. Summary of advantages of esProc
5
advantages
High performance
The processing speed of big data is 1 order
of magnitude higher than that of traditional
solutions
Flexible and open
Multi-source mixed computation
Can run independently, or embedded into
applications
Save resources
Single machine can match cluster, reducing
hardware expense
Environment-friendly
Sharp cost reduction
Development, hardware, O&M costs
reduced by X times
Efficient development
Procedural syntax, in line with natural thinking
Rich class libraries