Comparison between top BI & Data Discovery Solutions.
This is a short deck comparing Pyramid Analytics to QlikView, focusing on the core differences showing strengths and weaknesses.
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The presentation concludes by providing a reference architecture for connecting Confluent Platform to MongoDB.
1. The document discusses different hosting architectures for Magento 2 websites, from a single server setup to more advanced scalable architectures on AWS.
2. A single server setup provides simple installation but has limitations around resource contention and lack of elasticity. Moving to a traditional multi-server setup provides more isolation and redundancy but lacks flexibility.
3. The most flexible solution discussed is deploying to AWS using a microservices architecture, which allows dedicated resource provisioning and high scalability, but requires more complex infrastructure monitoring and operational investment.
This document introduces cloud computing, which provides computing resources as a service via virtualization and shared resources. Key aspects include flexible provisioning without large upfront costs, cost-effectiveness through on-demand access without ownership costs, and scalability through elastic scaling of resources via API calls. Popular cloud services from Amazon are described, including compute, storage, content delivery, DNS, databases, messaging, and data processing.
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scaleconfluent
This document summarizes Aaron Strey's presentation on how Target uses Apache Kafka to support omni-channel retail operations at scale. Some key points:
Target uses Kafka for log aggregation, threat detection, clickstream analysis, and business event messaging. They have over 1,800 stores and 38 distribution centers in the US, serving over 26 million online visitors per month.
Target's large Kafka deployment includes up to 300 topics per cluster, with 10-20 thousand consumer requests per second and compaction widely used. They aim for exactly once semantics across a diverse set of clients.
Strey suggests reinventing log aggregation to allow querying log streams directly from Kafka as easily as current methods using Elastic, to avoid indexing ter
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The document promotes using MongoDB Atlas as a turnkey database as a service solution and shows how it can be integrated with Confluent Cloud for streaming data workflows.
Comparison between top BI & Data Discovery Solutions.
This is a short deck comparing Pyramid Analytics to QlikView, focusing on the core differences showing strengths and weaknesses.
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The presentation concludes by providing a reference architecture for connecting Confluent Platform to MongoDB.
1. The document discusses different hosting architectures for Magento 2 websites, from a single server setup to more advanced scalable architectures on AWS.
2. A single server setup provides simple installation but has limitations around resource contention and lack of elasticity. Moving to a traditional multi-server setup provides more isolation and redundancy but lacks flexibility.
3. The most flexible solution discussed is deploying to AWS using a microservices architecture, which allows dedicated resource provisioning and high scalability, but requires more complex infrastructure monitoring and operational investment.
This document introduces cloud computing, which provides computing resources as a service via virtualization and shared resources. Key aspects include flexible provisioning without large upfront costs, cost-effectiveness through on-demand access without ownership costs, and scalability through elastic scaling of resources via API calls. Popular cloud services from Amazon are described, including compute, storage, content delivery, DNS, databases, messaging, and data processing.
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scaleconfluent
This document summarizes Aaron Strey's presentation on how Target uses Apache Kafka to support omni-channel retail operations at scale. Some key points:
Target uses Kafka for log aggregation, threat detection, clickstream analysis, and business event messaging. They have over 1,800 stores and 38 distribution centers in the US, serving over 26 million online visitors per month.
Target's large Kafka deployment includes up to 300 topics per cluster, with 10-20 thousand consumer requests per second and compaction widely used. They aim for exactly once semantics across a diverse set of clients.
Strey suggests reinventing log aggregation to allow querying log streams directly from Kafka as easily as current methods using Elastic, to avoid indexing ter
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The document promotes using MongoDB Atlas as a turnkey database as a service solution and shows how it can be integrated with Confluent Cloud for streaming data workflows.
Misusing MLflow To Help Deduplicate Data At ScaleDatabricks
At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.
Scaling Online ML Predictions At DoorDashDatabricks
DoorDash is a 3-sided marketplace that consists of Merchants, Consumers, and Dashers.
As DoorDash business grows, the online ML prediction volume grows exponentially to support the various Machine Learning use cases, such as the ETA predictions, the Dasher assignments, the personalized restaurants and menu items recommendations, and the ranking of the large volume of search queries.
The prediction service built to meet above use cases now supports many dozens of models spanning different Machine Learning algorithms such as gradient boosting, neural networks and rule-based. The service supports greater than 10 billion predictions every day with a peak hit rate of above 1 million per second.
In this session, we will share our journey of building and scaling our Machine Learning platform and particularly the prediction service, the various optimizations experimented, lessons learned, technical decisions and tradeoffs made. We will also share how we measure success and how we set goals for the future. Finally, we will end by highlighting the challenges ahead of us in extending our Machine Learning platform to support the Data Scientist community and a wider set use cases at DoorDash.
Building scalable cloud-native applications (Sam Vanhoutte at Codit Azure Paa...Codit
This document discusses learnings from building scalable cloud-native solutions. It covers considerations for scalability like decoupling services, partitioning data and throttling external communications. Specific patterns for communication between services are examined like asynchronous messaging for durability and load leveling. Testing is emphasized to ensure solutions can scale out. Designing for changes in the cloud over time with new services and features is also advised.
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
Nikita Shamgunov presented on the Real-Time Chief Data Officer and the cloud-forward path to predictive analytics. He discussed how MemSQL provides a modern data architecture that enables real-time access to all data, flexible deployments across public/private clouds, and a 360 view of the business without data silos. He showcased several customer use cases that demonstrated transforming analytics from weekly to daily using MemSQL and reducing latency from days to minutes. Finally, he proposed strategies for building a hybrid cloud approach and real-time analytics infrastructure to gain faster historical insights and predictive capabilities.
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
This document discusses building an IoT Kafka pipeline using MemSQL in under 5 minutes. It begins with an overview of IoT, Kafka, and operational data warehouses. It then discusses MemSQL and how it functions as an operational data warehouse by continuously loading and querying data in real-time. The document demonstrates launching a MemSQL cluster, creating schemas and pipelines to ingest, transform, persist and analyze IoT data from Kafka. It emphasizes MemSQL's ability to handle different data types and scales from IoT at high throughput with low latency.
This document discusses how database-as-a-service (DBaaS) using Cassandra is growing rapidly. It notes that DBaaS is forecast to be the fastest growing category over the next four years. It then provides best practices for using Cassandra on DBaaS, including starting small on non-critical applications, testing data modeling without premature tuning, and allowing the DBaaS to automatically scale across multiple regions, data centers, and providers.
How Azure turns out to be vital for Soludoc's innovation strategy (Geert Truy...Codit
This document discusses how a company called DataBANG uses Microsoft Azure to provide an end-to-end solution for business process outsourcing. DataBANG uses Azure event hubs and stream analytics to track documents through each step of their process and provide customers with detailed insights and reporting. They worked with a partner to design their Azure architecture, which evolved over two versions as their technical needs and Azure's capabilities changed. The Azure solution allows DataBANG to generate events for both internal and customer applications to achieve end-to-end visibility across company and customer boundaries.
Driving the On-Demand Economy with Predictive AnalyticsSingleStore
Nikita Shamgunov, CTO and Co-founder of MemSQL, discusses how MemSQL enables real-time predictive analytics through its in-memory and scale-out database. MemSQL allows data from hundreds of thousands of machines to be analyzed and delivers value through real-time code deployment, anomaly detection, and A/B testing results. MemSQL is a scalable, elastic, and real-time data warehouse that can be deployed on-premises, as a managed cloud service, or in multi-cloud environments.
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...Fastly
Braze is a customer engagement platform that delivers more than a billion messaging experiences across push, email, apps and more each day. In this session, Jon Hyman will describe the company's challenges during an inflection point in 2015 when the company reached the limitation of their physical networking equipment, and how Braze has since grown more than 7x on Fastly. Jon will also discuss how Braze uses Fastly's Layer 7 load balancing to improve stability and uptime of its APIs.
Introduction to Big Data using AWS ServicesAnjani Phuyal
This document provides an introduction to key concepts in data analytics including data, information, the types of data analytics, benefits and use cases of data analytics, challenges, and common data analytics tools. It also covers related topics like streaming data, data visualization, and big data.
OneSpot is an advertising platform that uses Amazon Redshift, a fast, petabyte-scale data warehouse service from AWS, to analyze large amounts of customer data. Redshift uses a column-oriented design and cluster architecture to optimize for read performance at large scales. It provides standard SQL functionality and can scale to petabytes of data, making it easy for OneSpot to manage over 300 billion rows of customer data without requiring a dedicated database administrator.
This document discusses building a scalable and performable data warehouse on Hadoop for business intelligence and analytics. It describes using a AWS r3 8Xlarge 4 Node Cluster and Oracle Exadata Appliance for the data tier stack. The data warehouse contains 100GB of fact table data from 650 million sales transactions across 9 dimensions involving 12 million customers. Tools like Cloudera's TPC-DS Toolkit and Impala are used to access and query the data warehouse. Challenges include integrating Hive and HBase with Impala and slow conversion from text to the more efficient Parquet format.
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...MongoDB
Braze uses MongoDB to store customer data and power sophisticated customer journeys. Nearly 10 billion customer profiles are stored across many MongoDB clusters. Campaigns for messaging customers are represented as documents with embedded objects for messages, scheduling, targeting, and conversions. Canvases orchestrate multi-step journeys by linking campaign documents through embedded steps and path variations. This data model allows Braze to quickly query customer segments and send hundreds of millions of personalized messages per hour.
Cloud Expo Europe 2014: Practical methods to improve your security in the cloudDatabarracks
This document provides practical methods for improving security in the cloud, including:
1) Using a nuclear bunker data center that is certified and accredited for security.
2) Implementing penetration testing of cloud platforms and servers.
3) Strengthening access controls through federated authentication and two-factor authentication.
4) Using firewalls, VPNs, and encryption of data arrays, files within VMs, and entire VMs to securely transmit and store data in the cloud.
How to Develop and Deploy Web-Scale Applications on AWSDatabarracks
Johan Holder presents our checklist for deployment of web-scale applications on AWS. We cover the fundamentals you need to apply when moving applications from a legacy hosting provider or building new services from scratch.
• How to architect your AWS environment to take advantage of specific services such as Elastic Load Balancing, CloudFront, Amazon SQS and S3.
• How to build for scalability, resilience and security
• How to manage your costs
We also show you how to avoid the common mistakes we see organisations make when getting started with AWS.
Rick Negrin discusses enabling real-time analytics for IoT applications. He describes how industries are increasingly needing real-time analytics due to trends like the on-demand economy and rise of IoT. He then outlines an architecture using Kafka for messaging and MemSQL for real-time analytics. MemSQL is presented as a SQL database that can ingest millions of events per second while analyzing petabytes of data. Finally, Negrin demonstrates an IoT application called MemEx that combines MemSQL, Kafka and Spark to enable predictive analytics on sensor data for supply chain management.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools. NOTE: Make this more themed towards QuickSight as it applies to other AWS Big Data Services - Redshift, Athena, S3, RDS.
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell
When you’re building the next killer mobile app, how can you ensure that your app is both stable and capable of near-instant data updates? The answer: Build a backend! Siva Katir says that there’s much more to building a backend than standing up a SQL server in your datacenter and calling it a day. Since different types of apps demand different backend services, how do you know what sort of backend you need? And, more importantly, how can you ensure that your backend scales so you can survive an explosion of users when you are featured in the app store? Siva discusses the common scenarios facing mobile app developers looking to expand beyond just the device. He’ll share best practices learned while building the PlayFab and other companies’ backends. Join Siva to learn how you can ensure that your app can scale safely and affordably into the millions of concurrent users and across multiple platforms.
Misusing MLflow To Help Deduplicate Data At ScaleDatabricks
At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.
Scaling Online ML Predictions At DoorDashDatabricks
DoorDash is a 3-sided marketplace that consists of Merchants, Consumers, and Dashers.
As DoorDash business grows, the online ML prediction volume grows exponentially to support the various Machine Learning use cases, such as the ETA predictions, the Dasher assignments, the personalized restaurants and menu items recommendations, and the ranking of the large volume of search queries.
The prediction service built to meet above use cases now supports many dozens of models spanning different Machine Learning algorithms such as gradient boosting, neural networks and rule-based. The service supports greater than 10 billion predictions every day with a peak hit rate of above 1 million per second.
In this session, we will share our journey of building and scaling our Machine Learning platform and particularly the prediction service, the various optimizations experimented, lessons learned, technical decisions and tradeoffs made. We will also share how we measure success and how we set goals for the future. Finally, we will end by highlighting the challenges ahead of us in extending our Machine Learning platform to support the Data Scientist community and a wider set use cases at DoorDash.
Building scalable cloud-native applications (Sam Vanhoutte at Codit Azure Paa...Codit
This document discusses learnings from building scalable cloud-native solutions. It covers considerations for scalability like decoupling services, partitioning data and throttling external communications. Specific patterns for communication between services are examined like asynchronous messaging for durability and load leveling. Testing is emphasized to ensure solutions can scale out. Designing for changes in the cloud over time with new services and features is also advised.
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
Nikita Shamgunov presented on the Real-Time Chief Data Officer and the cloud-forward path to predictive analytics. He discussed how MemSQL provides a modern data architecture that enables real-time access to all data, flexible deployments across public/private clouds, and a 360 view of the business without data silos. He showcased several customer use cases that demonstrated transforming analytics from weekly to daily using MemSQL and reducing latency from days to minutes. Finally, he proposed strategies for building a hybrid cloud approach and real-time analytics infrastructure to gain faster historical insights and predictive capabilities.
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
This document discusses building an IoT Kafka pipeline using MemSQL in under 5 minutes. It begins with an overview of IoT, Kafka, and operational data warehouses. It then discusses MemSQL and how it functions as an operational data warehouse by continuously loading and querying data in real-time. The document demonstrates launching a MemSQL cluster, creating schemas and pipelines to ingest, transform, persist and analyze IoT data from Kafka. It emphasizes MemSQL's ability to handle different data types and scales from IoT at high throughput with low latency.
This document discusses how database-as-a-service (DBaaS) using Cassandra is growing rapidly. It notes that DBaaS is forecast to be the fastest growing category over the next four years. It then provides best practices for using Cassandra on DBaaS, including starting small on non-critical applications, testing data modeling without premature tuning, and allowing the DBaaS to automatically scale across multiple regions, data centers, and providers.
How Azure turns out to be vital for Soludoc's innovation strategy (Geert Truy...Codit
This document discusses how a company called DataBANG uses Microsoft Azure to provide an end-to-end solution for business process outsourcing. DataBANG uses Azure event hubs and stream analytics to track documents through each step of their process and provide customers with detailed insights and reporting. They worked with a partner to design their Azure architecture, which evolved over two versions as their technical needs and Azure's capabilities changed. The Azure solution allows DataBANG to generate events for both internal and customer applications to achieve end-to-end visibility across company and customer boundaries.
Driving the On-Demand Economy with Predictive AnalyticsSingleStore
Nikita Shamgunov, CTO and Co-founder of MemSQL, discusses how MemSQL enables real-time predictive analytics through its in-memory and scale-out database. MemSQL allows data from hundreds of thousands of machines to be analyzed and delivers value through real-time code deployment, anomaly detection, and A/B testing results. MemSQL is a scalable, elastic, and real-time data warehouse that can be deployed on-premises, as a managed cloud service, or in multi-cloud environments.
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...Fastly
Braze is a customer engagement platform that delivers more than a billion messaging experiences across push, email, apps and more each day. In this session, Jon Hyman will describe the company's challenges during an inflection point in 2015 when the company reached the limitation of their physical networking equipment, and how Braze has since grown more than 7x on Fastly. Jon will also discuss how Braze uses Fastly's Layer 7 load balancing to improve stability and uptime of its APIs.
Introduction to Big Data using AWS ServicesAnjani Phuyal
This document provides an introduction to key concepts in data analytics including data, information, the types of data analytics, benefits and use cases of data analytics, challenges, and common data analytics tools. It also covers related topics like streaming data, data visualization, and big data.
OneSpot is an advertising platform that uses Amazon Redshift, a fast, petabyte-scale data warehouse service from AWS, to analyze large amounts of customer data. Redshift uses a column-oriented design and cluster architecture to optimize for read performance at large scales. It provides standard SQL functionality and can scale to petabytes of data, making it easy for OneSpot to manage over 300 billion rows of customer data without requiring a dedicated database administrator.
This document discusses building a scalable and performable data warehouse on Hadoop for business intelligence and analytics. It describes using a AWS r3 8Xlarge 4 Node Cluster and Oracle Exadata Appliance for the data tier stack. The data warehouse contains 100GB of fact table data from 650 million sales transactions across 9 dimensions involving 12 million customers. Tools like Cloudera's TPC-DS Toolkit and Impala are used to access and query the data warehouse. Challenges include integrating Hive and HBase with Impala and slow conversion from text to the more efficient Parquet format.
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...MongoDB
Braze uses MongoDB to store customer data and power sophisticated customer journeys. Nearly 10 billion customer profiles are stored across many MongoDB clusters. Campaigns for messaging customers are represented as documents with embedded objects for messages, scheduling, targeting, and conversions. Canvases orchestrate multi-step journeys by linking campaign documents through embedded steps and path variations. This data model allows Braze to quickly query customer segments and send hundreds of millions of personalized messages per hour.
Cloud Expo Europe 2014: Practical methods to improve your security in the cloudDatabarracks
This document provides practical methods for improving security in the cloud, including:
1) Using a nuclear bunker data center that is certified and accredited for security.
2) Implementing penetration testing of cloud platforms and servers.
3) Strengthening access controls through federated authentication and two-factor authentication.
4) Using firewalls, VPNs, and encryption of data arrays, files within VMs, and entire VMs to securely transmit and store data in the cloud.
How to Develop and Deploy Web-Scale Applications on AWSDatabarracks
Johan Holder presents our checklist for deployment of web-scale applications on AWS. We cover the fundamentals you need to apply when moving applications from a legacy hosting provider or building new services from scratch.
• How to architect your AWS environment to take advantage of specific services such as Elastic Load Balancing, CloudFront, Amazon SQS and S3.
• How to build for scalability, resilience and security
• How to manage your costs
We also show you how to avoid the common mistakes we see organisations make when getting started with AWS.
Rick Negrin discusses enabling real-time analytics for IoT applications. He describes how industries are increasingly needing real-time analytics due to trends like the on-demand economy and rise of IoT. He then outlines an architecture using Kafka for messaging and MemSQL for real-time analytics. MemSQL is presented as a SQL database that can ingest millions of events per second while analyzing petabytes of data. Finally, Negrin demonstrates an IoT application called MemEx that combines MemSQL, Kafka and Spark to enable predictive analytics on sensor data for supply chain management.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools. NOTE: Make this more themed towards QuickSight as it applies to other AWS Big Data Services - Redshift, Athena, S3, RDS.
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell
When you’re building the next killer mobile app, how can you ensure that your app is both stable and capable of near-instant data updates? The answer: Build a backend! Siva Katir says that there’s much more to building a backend than standing up a SQL server in your datacenter and calling it a day. Since different types of apps demand different backend services, how do you know what sort of backend you need? And, more importantly, how can you ensure that your backend scales so you can survive an explosion of users when you are featured in the app store? Siva discusses the common scenarios facing mobile app developers looking to expand beyond just the device. He’ll share best practices learned while building the PlayFab and other companies’ backends. Join Siva to learn how you can ensure that your app can scale safely and affordably into the millions of concurrent users and across multiple platforms.
AWS January 2016 Webinar Series - Getting Started with Big Data on AWSAmazon Web Services
With hundreds of new and sometimes disparate tools, it’s hard to keep pace. Amazon Web Services provides a broad and fully integrated portfolio of cloud computing services to help you build, secure and deploy your big data applications.
Attend this webinar to get an overview of the different big data options available in the AWS Cloud – including popular big data frameworks such as Hadoop, Spark, NoSQL databases, and more. Learn about ideal use cases, cases to avoid, performance, interfaces, and more. Finally, learn how you can build valuable applications with a real-life example.
Learning Objectives:
Learn about big data tools available at AWS
Understand ideal use cases
Learn some of the key considerations such as performance, scalability, elasticity and availability, when selecting big data tools
Who Should Attend:
Data Architects, Data Scientists, Developers
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Full 360 is a cloud consulting firm that provides big data, API/UX, and cloud operations services. They helped a customer migrate their data from Netezza to Redshift, building a structured data lake and optimizing queries for equivalent or better performance. Lessons from the project included data standardization, tuning techniques like encoding and sort keys, and creating reusable ingestion processes. The migration reduced license costs and improved operational flexibility.
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...Amazon Web Services
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. In this session we'll give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more.
IBM's Big Data platform provides tools for managing and analyzing large volumes of structured, unstructured, and streaming data. It includes Hadoop for storage and processing, InfoSphere Streams for real-time streaming analytics, InfoSphere BigInsights for analytics on data at rest, and PureData System for Analytics (formerly Netezza) for high performance data warehousing. The platform enables businesses to gain insights from all available data to capitalize on information resources and make data-driven decisions.
This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing, modeling and serving data on Azure. Finally, it discusses architectures like the lambda architecture and common data models.
This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing and modeling data in Azure. Finally, it discusses architectures like the lambda architecture and common data models.
IBM's Big Data platform provides tools for managing and analyzing large volumes of data from various sources. It allows users to cost effectively store and process structured, unstructured, and streaming data. The platform includes products like Hadoop for storage, MapReduce for processing large datasets, and InfoSphere Streams for analyzing real-time streaming data. Business users can start with critical needs and expand their use of big data over time by leveraging different products within the IBM Big Data platform.
Architecting Snowflake for High Concurrency and High PerformanceSamanthaBerlant
Cloud Data Warehousing juggernaut Snowflake has raced out ahead of the pack to deliver a data management platform from which a wealth of new analytics can be run. Using Snowflake as a traditional data warehouse has some obvious cost advantages over a hardware solution. But the real value of Snowflake as a data platform lies in its ability to support a high-concurrency analytics platform using Kyligence Cloud, powered by Apache Kylin.
In this presentation, Senior Solutions Architect Robert Hardaway will describe a modern data service architecture using precomputation and distributed indexes to provide interactive analytics to hundreds or even thousands of users running against very large Snowflake datasets (TBs to PBs).
An overview of modern scalable web developmentTung Nguyen
The document provides an overview of modern scalable web development trends. It discusses the motivation to build systems that can handle large amounts of data quickly and reliably. It then summarizes the evolution of software architectures from monolithic to microservices. Specific techniques covered include reactive system design, big data analytics using Hadoop and MapReduce, machine learning workflows, and cloud computing services. The document concludes with an overview of the technologies used in the Septeni tech stack.
How Analytics Teams Using SSAS Can Embrace Big Data and the CloudTyler Wishnoff
You’ve been using SQL Server Analytics Services and you love it.
Its multidimensional analysis enables your team to slice and dice your data any way they want and get the results back easily.
The only problem is:
• it doesn’t work very well with the latest technologies
• It wasn’t built to handle Big Data or to serve large analytics teams
• and most frustrating of all, it still isn’t available in the Cloud
The good news is that there’s a way to unburden yourself from the limitations of SSAS, without losing the capabilities you rely on.
If you’re ready to modernize the way your team does analytics, this presentation will provide you the tools and ideas you need to do so.
This presentation will show you how to:
• Efficiently migrate your SSAS-based workload to the Cloud
• Super-charge your SSAS applications, seamlessly, easily and cost-effectively
• Provide unlimited scale of data, concurrency and deliver sub-second response
• Make your SQL/MDX queries scaling and performing beyond your imagination
• Ensure Enterprise-grade Security
• Plus, get actionable examples and stories, from organizations who have successfully overcome these challenges
For more information, visit www.Kyligence.io
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
This document discusses how organizations can save money on database management systems (DBMS) by moving from expensive commercial DBMS to more affordable open-source options like PostgreSQL. It notes that PostgreSQL has matured and can now handle mission critical workloads. The document recommends partnering with EnterpriseDB to take advantage of their commercial support and features for PostgreSQL. It highlights how customers have seen cost savings of 35-80% by switching to PostgreSQL and been able to reallocate funds to new business initiatives.
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...Amazon Web Services
- Learn how to use Amazon Athena to query various data formats in Amazon S3
- Learn how to use Amazon QuickSight to visualize the results of your Athena query with and without using SPICE
Querying and analyzing big data can be complicated and expensive. It requires you to setup and manage databases, data warehouses, and business intelligence applications; all of which require time, effort, and resources. Using Amazon Athena and Amazon QuickSight, you can avoid the cost and complexity by creating a fast, scalable and serverless cloud analytics solution without the need to invest in databases, data warehouses, complex ETL solutions, and BI applications. In this tech talk, we will demonstrate how you can build a serverless big data analytics solution using Amazon Athena and Amazon QuickSight.
Serverless Big Data Analytics with Amazon Athena and Amazon Quicksight - May ...Amazon Web Services
Learning Objectives:
- Learn how to use Amazon Athena to query various data formats in Amazon S3
- Learn how to use Amazon QuickSight to visualize the results of your Athena query with and without using SPICE
Querying and analyzing big data can be complicated and expensive. It requires you to setup and manage databases, data warehouses, and business intelligence applications; all of which require time, effort, and resources. Using Amazon Athena and Amazon QuickSight, you can avoid the cost and complexity by creating a fast, scalable and serverless cloud analytics solution without the need to invest in databases, data warehouses, complex ETL solutions, and BI applications. In this tech talk, we will demonstrate how you can build a serverless big data analytics solution using Amazon Athena and Amazon QuickSight.
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB
Corva's analytics platform enables real-time engineering and machine learning predictions and powers faster and safer drilling. The platform utilizes AWS serverless Lambda & extensible, data-driven API with MongoDB to handle 100,000+ requests per minute of streaming sensor data.
Sisense is a business intelligence software company that provides analytics tools for businesses. Its product allows non-technical users to connect to and analyze large datasets from multiple sources. Key features include the ability to drag and drop data from various sources, build visualizations like dashboards without coding, and its in-chip technology that can analyze data much faster than other solutions. Sisense has over 700 customers worldwide across many industries. It is recognized for its innovative technology and high-performance capabilities.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
2. Data Engine & Scalability
Sisense
• ElastiCube: single, proprietary choice
• 500GB limit
• Single node only per cube; limited scale out options
• Ingestion only. No true direct querying options
Pyramid
• Multiple, mix-and-match, ‘open’ engines
• Unlimited Sizing
• Multiple scale up or out strategies
• Ingestion or true direct querying
ElastiCube In-Memory:
Pyramid, SAP Hana
MS OLAP/Tabular
MPP:
Teradata, Exadata
Veritica, Netezza
Relational:
Oracle, SQL Server,
DB2, Sybase, Postgres
Cloud:
Redshift, Snowflake
BigQuery, SQL Azure
Big Data / Unstructured:
Apache Presto / Drill,
Mongo
Data Engine ≠ Data Source
Data sources are required to
populate data engines.
Data
Sources
3. Visualizations & Analytics
Sisense
• Simple things are easier
• Time to basics = VERY FAST
• Advanced Analytics ONLY through code
• Requires a developer
• Time to real world analytics = VERY LONG
Pyramid
• Simple things are easy
• Time to basics = FAST
• Advanced Analytics requires NO CODE
• Point-and-Click
• Time to real world analytics = VERY SHORT
4. Publishing, Data Prep & Extensibility
Sisense
• Solid REST API Framework
• Embedding via IFRAMES: old & unscalable
• Basic data preparation tools without ML
• No Publishing or Printing capabilities
Pyramid
• Extensive REST API Framework
• Embedding via DIV tags: modern HTML5 & scales
• Comprehensive data preparation tools with ML
• End-user report publisher with pixel perfect printing
< />
ML
< />
API Embed
Data Prep Publishing
API Embed
Data Prep Publishing
5. Governance & Management
Sisense
• Basic content management
• No data lineage, no content versioning
• No multi-tenancy, rights control; limited security options
• No meta data layering. Data security for ElastiCube only
Pyramid
• Powerful content management
• Data lineage, versioning
• Multi-tenant, deep rights control and security options
• Meta data layering with complete data security on all sources
Content Management Data Lineage, Versioning
Multi-tenancy +
Central Security
Metadata for
All data sources
Content Management Data Lineage, Versioning
Multi-tenancy +
Central Security
Metadata for
All data sources
6. Price
Sisense
• Cost driven by users and data
• Cost increases based on data footprint size
• Pay MORE to use more
• Deployments start at $20k/year minimum
Pyramid
• Cost driven by user only
• No cost change for data footprint
• Pay the SAME to use more
• The first 3 users are free.