1) In-memory computing is growing rapidly, with the total data market expected to grow from $69 billion in 2015 to $132 billion in 2020.
2) In-memory databases are gaining popularity for applications that require fast response times, like telecommunications and mobile advertising, as memory access is faster than disk access.
3) Modern applications are driving adoption of in-memory solutions as they generate more data from more users and transactions and require faster performance to handle growing traffic.
4) Two examples presented were DellEMC using MemSQL for a real-time customer 360 application and an IoT logistics application called MemEx that processes sensor data from warehouses for predictive analytics.
Real-Time, Geospatial, Maps by Neil DahlkeSingleStore
This document discusses two real-time geospatial analytics demos using MemSQL - PowerStream and Supercar. PowerStream predicts the health of 197,000 wind turbines globally using 1 million data points per second from sensors. Supercar tracks NYC taxi and limo data in real-time to analyze the on-demand economy. Both demos extract, transform and load streaming data into MemSQL for real-time querying and visualization.
Real-Time Geospatial Intelligence at Scale SingleStore
This document introduces MemSQL 5, a real-time database platform for transactions and analytics. It discusses how MemSQL is designed for modern workloads by providing scalable SQL on in-memory and solid-state storage across distributed data centers or the cloud. MemSQL allows for real-time processing through features like stream processing and real-time dashboards. Examples are given of using MemSQL for Internet of Things applications to monitor wind turbines and taxi ride data.
CTO View: Driving the On-Demand Economy with Predictive AnalyticsSingleStore
In the on-demand economy real-time analytics is both a necessity and a competitive advantage. The next evolution in the on-demand economy is in predictive analytics fueled by live streams of data—in effect knowing what customers want before they do. This session will feature technical examples of real-time pipelines, machine learning, and custom dashboards as well as off-the-shelf dashboards with Tableau.
Driving the On-Demand Economy with Spark and Predictive AnalyticsSingleStore
The document discusses how data scientists need real-time analytics capabilities to power the on-demand economy. It introduces MemSQL 5 as a database platform for real-time analytics that can help overcome barriers like slow loading, queries, and ongoing data processing faced with batch processing. MemSQL 5 includes features like Streamliner for building real-time data pipelines and predictive analytics using Spark and MLlib to power applications like predictive scoring and IoT.
Real-Time Analytics with Confluent and MemSQLSingleStore
This document discusses enabling real-time analytics for IoT applications. It describes how industries like auto, transportation, energy, warehousing and logistics, and healthcare need real-time analytics to handle streaming data from IoT sensors. It also discusses how Confluent's Kafka stream processing platform can be used to build applications that ingest IoT data at high speeds, transform the data, and power real-time analytics and user interfaces. MemSQL's in-memory database is presented as a fast and scalable storage option to support real-time analytics on the large volumes of IoT data.
Hadoop can enable zero downtime app deployments by using microservices, continuous delivery, and real-time analytics. The presenters describe how Expedia saves $5M annually through zero downtime deployments. Their architecture uses microservices, continuous integration, deployment monitoring with Storm/Kafka/HDFS, and analytics in Solr/Hive to enable canary testing, fast feedback, and automated problem resolution. A live demo shows log processing, analytics, and using results to ensure smooth, high-quality deployments.
1) In-memory computing is growing rapidly, with the total data market expected to grow from $69 billion in 2015 to $132 billion in 2020.
2) In-memory databases are gaining popularity for applications that require fast response times, like telecommunications and mobile advertising, as memory access is faster than disk access.
3) Modern applications are driving adoption of in-memory solutions as they generate more data from more users and transactions and require faster performance to handle growing traffic.
4) Two examples presented were DellEMC using MemSQL for a real-time customer 360 application and an IoT logistics application called MemEx that processes sensor data from warehouses for predictive analytics.
Real-Time, Geospatial, Maps by Neil DahlkeSingleStore
This document discusses two real-time geospatial analytics demos using MemSQL - PowerStream and Supercar. PowerStream predicts the health of 197,000 wind turbines globally using 1 million data points per second from sensors. Supercar tracks NYC taxi and limo data in real-time to analyze the on-demand economy. Both demos extract, transform and load streaming data into MemSQL for real-time querying and visualization.
Real-Time Geospatial Intelligence at Scale SingleStore
This document introduces MemSQL 5, a real-time database platform for transactions and analytics. It discusses how MemSQL is designed for modern workloads by providing scalable SQL on in-memory and solid-state storage across distributed data centers or the cloud. MemSQL allows for real-time processing through features like stream processing and real-time dashboards. Examples are given of using MemSQL for Internet of Things applications to monitor wind turbines and taxi ride data.
CTO View: Driving the On-Demand Economy with Predictive AnalyticsSingleStore
In the on-demand economy real-time analytics is both a necessity and a competitive advantage. The next evolution in the on-demand economy is in predictive analytics fueled by live streams of data—in effect knowing what customers want before they do. This session will feature technical examples of real-time pipelines, machine learning, and custom dashboards as well as off-the-shelf dashboards with Tableau.
Driving the On-Demand Economy with Spark and Predictive AnalyticsSingleStore
The document discusses how data scientists need real-time analytics capabilities to power the on-demand economy. It introduces MemSQL 5 as a database platform for real-time analytics that can help overcome barriers like slow loading, queries, and ongoing data processing faced with batch processing. MemSQL 5 includes features like Streamliner for building real-time data pipelines and predictive analytics using Spark and MLlib to power applications like predictive scoring and IoT.
Real-Time Analytics with Confluent and MemSQLSingleStore
This document discusses enabling real-time analytics for IoT applications. It describes how industries like auto, transportation, energy, warehousing and logistics, and healthcare need real-time analytics to handle streaming data from IoT sensors. It also discusses how Confluent's Kafka stream processing platform can be used to build applications that ingest IoT data at high speeds, transform the data, and power real-time analytics and user interfaces. MemSQL's in-memory database is presented as a fast and scalable storage option to support real-time analytics on the large volumes of IoT data.
Hadoop can enable zero downtime app deployments by using microservices, continuous delivery, and real-time analytics. The presenters describe how Expedia saves $5M annually through zero downtime deployments. Their architecture uses microservices, continuous integration, deployment monitoring with Storm/Kafka/HDFS, and analytics in Solr/Hive to enable canary testing, fast feedback, and automated problem resolution. A live demo shows log processing, analytics, and using results to ensure smooth, high-quality deployments.
Modeling the Smart and Connected City of the Future with Kafka and SparkSingleStore
- Modeling the Smart and Connected City of the Future with Kafka and Spark discusses using Kafka, Spark, and MemSQL to build a real-time data pipeline for a hypothetical "MemCity" that captures data from 1.4 million households.
- The document outlines the components of the "Real-Time Trinity" - Kafka for a high-throughput message queue, Spark for data transformation, and MemSQL for real-time data serving and analytics.
- It also introduces MemSQL Streamliner, which is designed to simplify the creation of real-time data pipelines through a graphical interface and one-click deployment of integrated Apache Spark clusters.
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
Nikita Shamgunov presented on the Real-Time Chief Data Officer and the cloud-forward path to predictive analytics. He discussed how MemSQL provides a modern data architecture that enables real-time access to all data, flexible deployments across public/private clouds, and a 360 view of the business without data silos. He showcased several customer use cases that demonstrated transforming analytics from weekly to daily using MemSQL and reducing latency from days to minutes. Finally, he proposed strategies for building a hybrid cloud approach and real-time analytics infrastructure to gain faster historical insights and predictive capabilities.
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
This document discusses building an IoT Kafka pipeline using MemSQL in under 5 minutes. It begins with an overview of IoT, Kafka, and operational data warehouses. It then discusses MemSQL and how it functions as an operational data warehouse by continuously loading and querying data in real-time. The document demonstrates launching a MemSQL cluster, creating schemas and pipelines to ingest, transform, persist and analyze IoT data from Kafka. It emphasizes MemSQL's ability to handle different data types and scales from IoT at high throughput with low latency.
O'Reilly Media Webcast: Building Real-Time Data PipelinesSingleStore
As our customers tap into new sources of data or modify to existing data pipelines, we are often asked questions like: What technologies should we consider? Where can we reduce data latency? How can we simplify our data architecture?
To eliminate the guesswork, we teamed up with Ben Lorica, Chief Data Scientist at O’Reilly Media to host a webcast centered around building real-time data pipelines.
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
Five ways database modernization simplifies your data lifeSingleStore
This document provides an overview of how database modernization with MemSQL can simplify a company's data life. It discusses five common customer scenarios where database limitations are impacting data-driven initiatives: 1) Slow event to insight delays, 2) High concurrency causing "wait in line" analytics, 3) Costly performance requiring specialized hardware, 4) Slow queries limiting big data analytics, and 5) Deployment inflexibility restricting multi-cloud usage. For each scenario, it provides an example customer situation and solution using MemSQL, highlighting benefits like real-time insights, scalable user access, cost efficiency, accelerated big data analytics, and deployment flexibility. The document also introduces MemSQL capabilities for fast data ingestion, instant
Internet of Things and Multi-model Data InfrastructureSingleStore
The document discusses 451 Research, an information technology research and advisory company. It provides details on 451 Research such as its founding year, number of employees, clients, reports published, and locations. It also briefly discusses 451 Research's research areas including data, advisory services, events, and its relationship to The 451 Group.
MemSQL is an in-memory distributed database that provides fast data processing for real-time analytics. It allows companies to extract greater insights from big data in real time. MemSQL is used by companies for applications like ad targeting, recommendations, fraud detection, and more. It provides rapid data loading and querying, horizontal scalability, and supports both relational and JSON data. Case studies describe how companies like Comcast, Zynga, CPXi, and others use MemSQL to power applications that require real-time insights from massive datasets.
Machines and the Magic of Fast LearningSingleStore
Human-machine interaction is no longer the exclusive province of science fiction. The advance of the internet and connected devices has inspired data scientists to create machine-learning applications to extract value from these new forms of data.
So what's the next frontier?
Join MemSQL Engineer Michael Andrews and Sr. Director Mike Boyarski to learn how to use real-time data as a vehicle for operationalizing machine-learning models. Michael and Mike will explore advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change.
You will learn:
Top technologies for building the ideal machine-learning stack
How to power machine-learning applications with real-time data
A use case and demo of machine learning for social good
Building the Next-gen Digital Meter Platform for FluviusDatabricks
Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
Building Real-Time Data Pipelines with Kafka, Spark, and MemSQLSingleStore
1) The document discusses building real-time data pipelines with Apache Spark and MemSQL to enable real-time analytics.
2) It describes combining the power of Spark for real-time transformations with MemSQL, a real-time database, to make Spark results more accessible.
3) The presentation includes a demo of PowerStream, a MemSQL application that predicts the health of wind turbines using streaming data.
"Building Real-Time Data Pipelines with Kafka and MemSQL" by Rick Negrin, Director of Product Management at MemSQL for Orange County Roadshow March 17, 2017.
The Fast Path to Building Operational Applications with SparkSingleStore
Nikita Shamgunov gave a presentation about using MemSQL and Spark together. MemSQL is a scalable operational database that can handle petabytes of data with high concurrency. It offers real-time capabilities and compatibility with tools like Spark, Kafka, and ETL/BI tools. The MemSQL Spark Connector allows bidirectional transfer of data between Spark and MemSQL tables for use cases like operationalizing models in Spark, stream/event processing, and live dashboards. Case studies showed customers gaining 10x faster data refresh times and performing entity resolution at scale for fraud detection.
Webinar: BI in the Sky - The New Rules of Cloud AnalyticsSnapLogic
In this webinar, we talk about the shift in data gravity as more and more business applications are moving to the cloud, and how the ability to deliver analytics in the cloud has evolved from idea to enterprise reality with new solutions being announced constantly that appeal to the need for speed, simplicity and access to insight on demand. Joining us in this webinar is David Glueck, Sr. Director of Data Science and Engineering at Bonobos.
To learn more, visit: www.SnapLogic.com/salesforce-analytics
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
The document summarizes the evolution of Netflix's big data platform to meet the challenges of their growing scale. Key points include:
- Netflix now has over 65 million members in over 50 countries and supports over 1000 devices. They stream over 10 billion hours of content per quarter.
- Their traditional business intelligence stack could no longer meet the demands of scale. They transitioned to using AWS services like S3 for storage and open source tools like Kafka, Cassandra, and Parquet to enable real-time analytics and machine learning on their massive data volumes.
- Netflix has adopted an open source-first strategy and contributes back to the community as their own tools evolve to meet processing needs and achieve the necessary scale to
Spark Summit East Keynote by Anjul BhambhriJen Aman
Apache Spark is a framework for large-scale data processing. IBM fully supports Spark and is building it into many of its products and services. Spark can handle both batch and streaming analytics efficiently using techniques like the Lambda architecture. IBM discusses several use cases for Spark including weather data analytics, healthcare data lakes, and customer experience analysis in telecom.
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations.
Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkSingleStore
This document discusses real-time supply chain analytics using machine learning, Kafka, and Spark. It outlines four key requirements for real-time supply chain databases: supporting massive data ingestion, serving as a system of record while providing real-time analytics, integrating with familiar ecosystems, and allowing for online scaling. The document then introduces MemSQL as a database platform that can meet these requirements using an in-memory approach. It provides an example called MemEx that combines MemSQL, Kafka, and Spark with machine learning for global supply chain management and real-time predictive analytics.
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore
Robin Li, Director of Data Engineering and Yohan Chin, VP Data Science at Tapjoy share how to architect the best application experience for mobile users using technologies including Apache Kafka, Apache Spark, and MemSQL.
Speaker: Robin Li - Director of Data Engineering, Tapjoy and Yohan Chin - VP Data Science, Tapjoy
Modeling the Smart and Connected City of the Future with Kafka and SparkSingleStore
- Modeling the Smart and Connected City of the Future with Kafka and Spark discusses using Kafka, Spark, and MemSQL to build a real-time data pipeline for a hypothetical "MemCity" that captures data from 1.4 million households.
- The document outlines the components of the "Real-Time Trinity" - Kafka for a high-throughput message queue, Spark for data transformation, and MemSQL for real-time data serving and analytics.
- It also introduces MemSQL Streamliner, which is designed to simplify the creation of real-time data pipelines through a graphical interface and one-click deployment of integrated Apache Spark clusters.
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
Nikita Shamgunov presented on the Real-Time Chief Data Officer and the cloud-forward path to predictive analytics. He discussed how MemSQL provides a modern data architecture that enables real-time access to all data, flexible deployments across public/private clouds, and a 360 view of the business without data silos. He showcased several customer use cases that demonstrated transforming analytics from weekly to daily using MemSQL and reducing latency from days to minutes. Finally, he proposed strategies for building a hybrid cloud approach and real-time analytics infrastructure to gain faster historical insights and predictive capabilities.
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
This document discusses building an IoT Kafka pipeline using MemSQL in under 5 minutes. It begins with an overview of IoT, Kafka, and operational data warehouses. It then discusses MemSQL and how it functions as an operational data warehouse by continuously loading and querying data in real-time. The document demonstrates launching a MemSQL cluster, creating schemas and pipelines to ingest, transform, persist and analyze IoT data from Kafka. It emphasizes MemSQL's ability to handle different data types and scales from IoT at high throughput with low latency.
O'Reilly Media Webcast: Building Real-Time Data PipelinesSingleStore
As our customers tap into new sources of data or modify to existing data pipelines, we are often asked questions like: What technologies should we consider? Where can we reduce data latency? How can we simplify our data architecture?
To eliminate the guesswork, we teamed up with Ben Lorica, Chief Data Scientist at O’Reilly Media to host a webcast centered around building real-time data pipelines.
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
Five ways database modernization simplifies your data lifeSingleStore
This document provides an overview of how database modernization with MemSQL can simplify a company's data life. It discusses five common customer scenarios where database limitations are impacting data-driven initiatives: 1) Slow event to insight delays, 2) High concurrency causing "wait in line" analytics, 3) Costly performance requiring specialized hardware, 4) Slow queries limiting big data analytics, and 5) Deployment inflexibility restricting multi-cloud usage. For each scenario, it provides an example customer situation and solution using MemSQL, highlighting benefits like real-time insights, scalable user access, cost efficiency, accelerated big data analytics, and deployment flexibility. The document also introduces MemSQL capabilities for fast data ingestion, instant
Internet of Things and Multi-model Data InfrastructureSingleStore
The document discusses 451 Research, an information technology research and advisory company. It provides details on 451 Research such as its founding year, number of employees, clients, reports published, and locations. It also briefly discusses 451 Research's research areas including data, advisory services, events, and its relationship to The 451 Group.
MemSQL is an in-memory distributed database that provides fast data processing for real-time analytics. It allows companies to extract greater insights from big data in real time. MemSQL is used by companies for applications like ad targeting, recommendations, fraud detection, and more. It provides rapid data loading and querying, horizontal scalability, and supports both relational and JSON data. Case studies describe how companies like Comcast, Zynga, CPXi, and others use MemSQL to power applications that require real-time insights from massive datasets.
Machines and the Magic of Fast LearningSingleStore
Human-machine interaction is no longer the exclusive province of science fiction. The advance of the internet and connected devices has inspired data scientists to create machine-learning applications to extract value from these new forms of data.
So what's the next frontier?
Join MemSQL Engineer Michael Andrews and Sr. Director Mike Boyarski to learn how to use real-time data as a vehicle for operationalizing machine-learning models. Michael and Mike will explore advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change.
You will learn:
Top technologies for building the ideal machine-learning stack
How to power machine-learning applications with real-time data
A use case and demo of machine learning for social good
Building the Next-gen Digital Meter Platform for FluviusDatabricks
Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
Building Real-Time Data Pipelines with Kafka, Spark, and MemSQLSingleStore
1) The document discusses building real-time data pipelines with Apache Spark and MemSQL to enable real-time analytics.
2) It describes combining the power of Spark for real-time transformations with MemSQL, a real-time database, to make Spark results more accessible.
3) The presentation includes a demo of PowerStream, a MemSQL application that predicts the health of wind turbines using streaming data.
"Building Real-Time Data Pipelines with Kafka and MemSQL" by Rick Negrin, Director of Product Management at MemSQL for Orange County Roadshow March 17, 2017.
The Fast Path to Building Operational Applications with SparkSingleStore
Nikita Shamgunov gave a presentation about using MemSQL and Spark together. MemSQL is a scalable operational database that can handle petabytes of data with high concurrency. It offers real-time capabilities and compatibility with tools like Spark, Kafka, and ETL/BI tools. The MemSQL Spark Connector allows bidirectional transfer of data between Spark and MemSQL tables for use cases like operationalizing models in Spark, stream/event processing, and live dashboards. Case studies showed customers gaining 10x faster data refresh times and performing entity resolution at scale for fraud detection.
Webinar: BI in the Sky - The New Rules of Cloud AnalyticsSnapLogic
In this webinar, we talk about the shift in data gravity as more and more business applications are moving to the cloud, and how the ability to deliver analytics in the cloud has evolved from idea to enterprise reality with new solutions being announced constantly that appeal to the need for speed, simplicity and access to insight on demand. Joining us in this webinar is David Glueck, Sr. Director of Data Science and Engineering at Bonobos.
To learn more, visit: www.SnapLogic.com/salesforce-analytics
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
The document summarizes the evolution of Netflix's big data platform to meet the challenges of their growing scale. Key points include:
- Netflix now has over 65 million members in over 50 countries and supports over 1000 devices. They stream over 10 billion hours of content per quarter.
- Their traditional business intelligence stack could no longer meet the demands of scale. They transitioned to using AWS services like S3 for storage and open source tools like Kafka, Cassandra, and Parquet to enable real-time analytics and machine learning on their massive data volumes.
- Netflix has adopted an open source-first strategy and contributes back to the community as their own tools evolve to meet processing needs and achieve the necessary scale to
Spark Summit East Keynote by Anjul BhambhriJen Aman
Apache Spark is a framework for large-scale data processing. IBM fully supports Spark and is building it into many of its products and services. Spark can handle both batch and streaming analytics efficiently using techniques like the Lambda architecture. IBM discusses several use cases for Spark including weather data analytics, healthcare data lakes, and customer experience analysis in telecom.
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations.
Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkSingleStore
This document discusses real-time supply chain analytics using machine learning, Kafka, and Spark. It outlines four key requirements for real-time supply chain databases: supporting massive data ingestion, serving as a system of record while providing real-time analytics, integrating with familiar ecosystems, and allowing for online scaling. The document then introduces MemSQL as a database platform that can meet these requirements using an in-memory approach. It provides an example called MemEx that combines MemSQL, Kafka, and Spark with machine learning for global supply chain management and real-time predictive analytics.
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore
Robin Li, Director of Data Engineering and Yohan Chin, VP Data Science at Tapjoy share how to architect the best application experience for mobile users using technologies including Apache Kafka, Apache Spark, and MemSQL.
Speaker: Robin Li - Director of Data Engineering, Tapjoy and Yohan Chin - VP Data Science, Tapjoy
This document summarizes performance testing of an HP DL980 database server and DL380 ION Data Accelerator storage system. It describes the hardware configuration, storage pool setup, initiator and OS tuning, Oracle configuration, and results of fio and Oracle benchmark tests. Sequential read/write throughput exceeded 1GB/s and random IOPS exceeded 300,000 with sub-millisecond latency. Oracle testing showed significant performance gains over direct-attached storage.
Huawei SAPPHIRE presentation on KunLun 32-socket serverMike Nelson
Huawei SAPPHIRE 2016 presentation at SUSE Mini-theatre. An introduction to the first 32-socket x86 server for Mission-Critical computing. Speaker: Francis Lam, Huawei
This document introduces MemSQL 4, an in-memory distributed relational database. MemSQL provides real-time transactional processing and analytics. Key features of MemSQL 4 include a disk-based column store for analytics and data retention, cross data center replication, multi-statement transactions, and a new optimizer. MemSQL is used by enterprises for applications like real-time analytics, risk management, personalization, and infrastructure consolidation.
This document discusses in-memory databases and MemSQL's architecture. It begins by defining in-memory databases as databases that store data primarily in main memory rather than on disk. While they can spill to disk, the goal is to keep as much data as possible in fast memory. It then discusses MemSQL's rowstore and columnstore architectures, including how they implement concurrency control, crash recovery, and durability while maximizing memory performance.
This document discusses the growing popularity and capabilities of the Apache Spark platform for large-scale data analytics. It notes that Spark has over 40 committers, 1000 contributors, and is being used in 179 projects. The document highlights key features of Spark like its ease of use, performance (10-100x faster than MapReduce), flexibility, and ability to handle both batch and real-time processing. It also provides examples of how Spark can help businesses by enabling more complex analytics like predictive modeling, enabling smarter predictions, and allowing insights from real-time data. The document emphasizes that Spark advocates should focus on illustrating tangible business benefits over technical features when discussing Spark with higher-level business stakeholders.
MemSQL - The Real-time Analytics PlatformSingleStore
MemSQL is the leader in real-time Big Data analytics, empowering organizations to make datadriven decisions, better engage customers, and gain a competitive advantage. The in-memory distributed database at the heart of MemSQL’s real-time analytics platform is proven in production environments across hundreds of nodes in the most high-velocity Big Data environments in the world.
In-Memory Database System Built for Speed and ScaleSingleStore
MemSQL is an in-memory database system built for speed and scale. It uses lock-free skip lists for fast indexing and code generation to optimize query execution. MemSQL supports hybrid transactional/analytical processing workloads by allowing analytics to run over concurrently changing data. It delivers high throughput and low latency for both transactions and analytics by storing data entirely in memory. MemSQL is distributed, provides online operations, and uses various techniques like durability, replication, and clustering to ensure reliability and scalability.
Elevating customer analytics - how to gain a 720 degree view of your customerActian Corporation
big data creates significant opportunities for marketers. Using big data analytics tools, marketers can improve decision making, deliver better value for their marketing spend, create truly personalized customer experiences, and understand their audience at the level of each individual consumer.
For 30 years the central fact of database performance was the gigantic difference in the time it takes to access a random piece of data in RAM versus on a hard drive. It’s now feasible to skip all that heartache by placing your data entirely in RAM. It’s not as simple as that, of course. You can’t just take a btree, mmap it, and call it a day. There are a lot of implications to a truly memory-native design that have yet to be unwound.
These two trends are producing an entirely new way to think about, design, and build applications. So let’s talk about how we got here, how we’re doing, and hints about where the future will take us.
This document introduces MemSQL Pipelines, which allow for exactly-once data ingestion semantics when streaming data from Kafka into MemSQL. MemSQL Pipelines provide a native way to extract, transform, and load external data into MemSQL tables. They offer a scalable and highly performant ETL process across a distributed cluster. The document explains streaming semantics like at least once and exactly once delivery, and how MemSQL Pipelines coordinate with Kafka to enable exactly-once ingestion through offset tracking. It presents the architecture of MemSQL Pipelines and demonstrates their use through a live demo.
MemSQL is an in-memory relational database that provides horizontal scaling and distributed processing. It uses a shared-nothing architecture with independent database instances ("memsqld") that can process queries in parallel. MemSQL stores data either in memory or SSDs for fast performance of up to millions of queries per second. It also provides features for high availability, transactions, logging, and analytics on large datasets.
Journey to the Real-Time Analytics in Extreme GrowthSingleStore
The document summarizes AppsFlyer's journey to implement a real-time analytics solution to handle their extreme growth and increasing data volumes. They were previously using TokuDB but it was failing weekly and not scalable. They tried Druid but it did not meet their requirements. They then implemented MemSQL, an in-memory database, which provided faster query latency, recoverability, and the ability to scale to handle 30x more data while reducing costs. Their current architecture uses Kafka to ingest data, MemSQL clusters for real-time queries and a daily batch process to a columnstore for history.
In-Memory Database Performance on AWS M4 InstancesSingleStore
This document summarizes a workshop agenda on MemSQL, an in-memory distributed SQL database. The agenda covers an introduction to MemSQL as a company and software, a discussion of current data challenges, and a demonstration of MemSQL's architecture, features like transactions and high availability, system requirements, licensing, and a speed test. Hands-on exercises are also included to showcase MemSQL's capabilities.
Virtual san hardware guidance & best practicessolarisyougood
This document provides guidance on building and designing Virtual SAN hardware solutions. It discusses considerations for components like boot devices, flash-based devices, and capacity sizing. It also provides an overview of Virtual SAN certified hardware platforms and best practices for designing a balanced and fault-tolerant configuration.
Lambda at Weather Scale by Robbie StricklandSpark Summit
This document discusses The Weather Company's use of Cassandra and data analytics. Some key points:
- TWC collects ~30 billion API requests and ~360 PB of data daily from 120 million mobile users.
- Early attempts involved batch loading large datasets into Cassandra, which was slow and expensive. Streaming data via Kafka and REST services was also unnecessary.
- The improved architecture uses Cassandra for streaming data with individual tables for each event type. All other data is stored in S3. Amazon SQS replaces Kafka for reliable streaming ingestion.
- Data exploration is critical and is now done in minutes using tools like Zeppelin, rather than over a month as before.
- SAP provides enterprise applications and platforms used by many large companies worldwide. It is looking to better leverage big data from sources like IoT sensors and customer transactions by integrating it with its core enterprise applications.
- SAP introduced SAP HANA Vora, an in-memory query engine that extends Apache Spark to allow enterprises to enrich analytics by connecting data in HANA databases to big data sources in HDFS.
- Case studies highlighted how HANA Vora helps industries like utilities and airlines optimize operations by interactively querying sensor and transaction data stored across HANA and HDFS.
This document summarizes a webinar about using Informatica Cloud to load big data into AWS services like Amazon Redshift for analytics. It discusses how Informatica Cloud can help consolidate and analyze customer data from multiple sources for a company called UBM to improve customer insights. The webinar also provides an example of how UBM used Informatica Cloud and Redshift to better understand customer behaviors and identify potential event attendees through analytics.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.
AWS Webcast - Sales Productivity Solutions with MicroStrategy and RedshiftAmazon Web Services
Sales Force Automation (SFA) and Customer Relationship Management (CRM) tools, such as Salesforce.com and Microsoft Dynamics CRM, are ubiquitous tools that provide all of the transactional capabilities required to manage a company's sales pipeline. SFA and CRM data alone, however, is limited and so combining it with information from other sources enables you to create unique and powerful insights. When combined with product and financial data, for example, get visibility into relationships between geographies, sales reps, product performance, and revenue to ultimately optimize profits. Layer on advanced analytic to make predictions about future product sales based on seasonality and other market conditions. To unleash the full power of the CRM and dramatically increase operational performance and top-line revenue, companies are leveraging advanced analytic and data visualization to deliver new insights to the entire sales organization. Moreover, delivering these sales enablement productivity solutions on mobile devices, ensures strong adoption across every sales team. Join us in this webinar to learn how to use MicroStrategy together with Amazon Redshift to build mobile sales productivity solutions for your business.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA
Syncsort data integration solution and data quality solution on hadoop can help accelerate the process of Populating your Enterprise Data Hub with data from multiple disparate data sources like legacy systems, databases, ERPs ,CRMs ,etc. Standardizing and cleansing the data before it is ingested into the data lake will dramatically increase the analytics value proposition.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
Converting production databases into live data streams for Apache Kafka can be labor intensive and costly. As Kafka architectures grow, complexity also rises as data teams begin to configure clusters for redundancy, partitions for performance, as well as for consumer groups for correlated analytics processing. In this breakout session, you’ll hear data streaming success stories from Generali and Skechers that leverage Qlik Data Integration and Confluent. You’ll discover how Qlik’s data integration platform lets organizations automatically produce real-time transaction streams into Kafka, Confluent Platform, or Confluent Cloud, deliver faster business insights from data, enable streaming analytics, as well as streaming ingestion for modern analytics. Learn how these customer use Qlik and Confluent to: - Turn databases into live data feeds - Simplify and automate the real-time data streaming process - Accelerate data delivery to enable real-time analytics Learn how Skechers and Generali breathe new life into data in the cloud, stay ahead of changing demands, while lowering over-reliance on resources, production time and costs.
The document discusses recommendations for Cummins' future data warehousing architecture and strategy. It recommends that Cummins:
1) Move certain databases from Oracle to Teradata's Active Data Warehouse private cloud to improve performance and scalability.
2) Implement Hadoop-as-a-Service using Google Compute Engine and MapR to handle big data and provide an enterprise data hub.
3) Adopt Cisco's Composite Data Virtualization Platform to provide a unified logical view of all company data from traditional and big data sources.
4) Add Tableau and Spotfire to the existing BI tools for advanced analytics and visualization.
5) Acquire IBM InfoSphere Streams to enable real-time business
IBM's Big Data platform provides tools for managing and analyzing large volumes of structured, unstructured, and streaming data. It includes Hadoop for storage and processing, InfoSphere Streams for real-time streaming analytics, InfoSphere BigInsights for analytics on data at rest, and PureData System for Analytics (formerly Netezza) for high performance data warehousing. The platform enables businesses to gain insights from all available data to capitalize on information resources and make data-driven decisions.
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
Many enterprises are turning to Apache Hadoop to enable Big Data Analytics and reduce the costs of traditional data warehousing. Yet, it is hard to succeed when 80% of the time is spent on moving data and only 20% on using it. It’s time to swap the 80/20! The Big Data experts at Attunity and Hortonworks have a solution for accelerating data movement into and out of Hadoop that enables faster time-to-value for Big Data projects and a more complete and trusted view of your business. Join us to learn how this solution can work for you.
IBM's Big Data platform provides tools for managing and analyzing large volumes of data from various sources. It allows users to cost effectively store and process structured, unstructured, and streaming data. The platform includes products like Hadoop for storage, MapReduce for processing large datasets, and InfoSphere Streams for analyzing real-time streaming data. Business users can start with critical needs and expand their use of big data over time by leveraging different products within the IBM Big Data platform.
The document discusses challenges with traditional data warehousing and analytics including high upfront costs, difficulty managing infrastructure, and inability to scale easily. It introduces Amazon Web Services (AWS) and Amazon Redshift as a solution, allowing for easy setup of data warehousing and analytics in the cloud at low costs without large upfront investments. AWS services like Amazon Redshift provide flexible, scalable infrastructure that is easier to manage than traditional on-premise systems and enables organizations to more effectively analyze large amounts of data.
Achieving Business Value by Fusing Hadoop and Corporate DataInside Analysis
The Briefing Room with Richard Hackathorn and Teradata
Live Webcast March 25, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=e7254708146d056339a0974f097f569b2
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful analytic solutions require a fusion of all relevant data, big and small, which has proven challenging for many companies. By allowing business analysts to quickly access data wherever it rests, success factors shift to focus on three key aspects: 1) business objectives, 2) organizational workflow, and 3) data placement.
Register for this Special Edition of The Briefing Room to hear veteran Analyst Richard Hackathorn as he provides details from his recent research report focused on success stories using Teradata QueryGrid. Examples of use cases described will include:
Joining sensor data in Hadoop with data warehouse labor schedules in seconds
How bridging corporate cultures and systems creates new business opportunities
The 360 view of customer journeys using weblogs in Hadoop via BI tools
How can you put the data where you want and query it however you want
Virtualizing Hadoop data with Teradata QueryGrid
Visit InsideAnalysis.com for more information.
Slides: Success Stories for Data-to-CloudDATAVERSITY
Companies are finding accessing data from a variety of sources can be labor-intensive and costly. Oftentimes these companies are looking to cloud solutions, but are then finding the traditional architecture brittle when trying to move data to the cloud, which can drain organizations of time and resources.
Join this webinar to hear several company success stories, the data-to-cloud issues they were encountering, and the steps these companies took to bring their cloud architecture to a successful, real-time analytic solution unlocking massive amounts of fresh enterprise-wide on a continuous basis.
In addition, you will learn how to:
• Modernize the ETL process to one that’s fast, flexible, and scalable
• Supply users with up-to-date, accurate, trusted data
• Increase your time to value with data in the cloud
• Best practices on how to minimize resource overhead
Your Roadmap for An Enterprise Graph StrategyNeo4j
This document provides a roadmap for developing an enterprise graph strategy with the following key steps:
1) Identify a "graphy problem" that a graph database could help solve based on input from business stakeholders.
2) Design and build a proof-of-concept graph using a local Neo4j instance to model sample data and write example queries.
3) Pick and build a demo application to showcase the value of the graph to stakeholders based on the sample data and queries.
Relevance of time series databases & druid.ioMuniraju V
This document discusses the relevance of time series databases for real-time solutions. It begins with introductions and discusses how business focus is shifting towards real-time opportunities and use cases that require processing data immediately. It then discusses challenges with using traditional databases for real-time solutions and outlines alternatives like time series databases. Specific examples of Druid.io are provided, including its features, the author's experience building a demo using it, and a sample reference architecture.
What's New in Syncsort's Trillium Line of Data Quality Software - TSS Enterpr...Precisely
Today, in the age of big data, data quality is more essential than ever. Whatever the size of your data – you need it to be clean, free of duplicates and ready for use.
View this customer education webinar on-demand where you will learn more about the latest improvements in the market-leading data quality solution – Syncsort’s TSS Enterprise, and how it can help you receive a quicker ROI from your Syncsort Trillium investment.
During this webinar, you will learn more about new TSS Enterprise 15.8 features such as:
• Performance improvements in Syncsort Trillium Discovery
• Syncsort’s Collibra integration for a stronger data governance capability
• Added support for Amazon EMR to Syncsort Trillium Quality for Big Data
• The NEW real-time data quality function
Don’t have TSS? View this webinar on-demand to see what you may be missing by not having market-leading data quality solutions. Whether you need to de-duplicate millions of records on Spark, want to fix data errors in real-time in your CRM or build geo-location and address verification into your web application – we’ve got what you’re looking for!
The document discusses optimizing a data warehouse by offloading some workloads and data to Hadoop. It identifies common challenges with data warehouses like slow transformations and queries. Hadoop can help by handling large-scale data processing, analytics, and long-term storage more cost effectively. The document provides examples of how customers benefited from offloading workloads to Hadoop. It then outlines a process for assessing an organization's data warehouse ecosystem, prioritizing workloads for migration, and developing an optimization plan.
Digital Business Transformation in the Streaming EraAttunity
Enterprises are rapidly adopting stream computing backbones, in-memory data stores, change data capture, and other low-latency approaches for end-to-end applications. As businesses modernize their data architectures over the next several years, they will begin to evolve toward all-streaming architectures. In this webcast, Wikibon, Attunity, and MemSQL will discuss how enterprise data professionals should migrate their legacy architectures in this direction. They will provide guidance for migrating data lakes, data warehouses, data governance, and transactional databases to support all-streaming architectures for complex cloud and edge applications. They will discuss how this new architecture will drive enterprise strategies for operationalizing artificial intelligence, mobile computing, the Internet of Things, and cloud-native microservices.
Link to the Wikibon report - wikibon.com/wikibons-2018-big-data-analytics-trends-forecast
Link to Attunity Streaming CDC Book Download - http://www.bit.ly/cdcbook
Link to MemSQL's Free Data Pipeline Book - http://go.memsql.com/oreilly-data-pipelines
How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore
This document provides an overview of how Kafka and modern databases like MemSQL can benefit applications and analytics. It discusses how businesses now require faster data access and intra-day processing to drive real-time decisions. Traditional database solutions struggle to meet these demands. MemSQL is presented as a solution that provides scalable SQL, fast ingestion of streaming data, and high concurrency to enable both transactions and analytics on large datasets. The document demonstrates how MemSQL distributes data and queries across nodes and allows horizontal scaling through its architecture.
The database market is large and filled with many solutions. In this talk, Seth Luersen from MemSQL we will take a look at what is happening within AWS, the overall data landscape, and how customers can benefit from using MemSQL within the AWS ecosystem.
Building the Foundation for a Latency-Free LifeSingleStore
The document discusses how MemSQL is able to process 1 trillion rows per second on 12 Intel servers running MemSQL. It demonstrates this throughput by running a query to count the number of trades for the top 10 most traded stocks from a dataset of over 115 billion rows of simulated NASDAQ trade data. The document argues that a latency-free operational and analytical data platform like MemSQL that can handle both high-volume operational workloads and complex queries is key to powering real-time analytics and decision making.
Converging Database Transactions and Analytics SingleStore
delivered at the Gartner Data and Analytics 2018 show in Texas. This presentation discusses real-time applications and their impact on existing data infrastructures
Building a Machine Learning Recommendation Engine in SQLSingleStore
This document discusses building machine learning recommendation engines using SQL. It begins with an overview of data and analytics trends including the convergence of operational and analytical databases. The rise of machine learning is then covered along with how databases are integrating machine learning capabilities. A live demo is presented using the Yelp dataset to build a recommendation engine directly in SQL, leveraging the database's extensibility, stored procedures, and user defined functions. The document argues that training can be done externally but operational scoring can and should be done directly in the database for real-time applications.
MemSQL 201: Advanced Tips and Tricks WebcastSingleStore
This document summarizes a webinar on advanced tips and tricks for MemSQL. It discusses the differences between rowstore and columnstore storage models and when each is best used. It also covers data ingestion using MemSQL Pipelines for real-time loading, data sharding and query tuning techniques like using reference tables. Additionally, it discusses monitoring memory usage, workload management using management views, and query optimization tools like analyzing and optimizing tables.
Mike Boyarski gave a presentation on MemSQL, an operational data warehouse that provides real-time analytics capabilities. He discussed challenges with traditional databases around slow data loading, lengthy query times, and low concurrency. MemSQL addresses these issues with fast data ingestion, low latency queries, and high scalability. It can ingest streaming data, run on a variety of platforms, and provides security, SQL support, and integration with common data tools. MemSQL was shown augmenting an existing IoT architecture to enable real-time analytics through fast data loading, consolidated data storage, and high query performance.
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Building a Fault Tolerant Distributed ArchitectureSingleStore
This talk will highlight some of the challenges to building a fault tolerant distributed architecture, and how MemSQL's architecture tackles these challenges.
Stream Processing with Pipelines and Stored ProceduresSingleStore
This talk will discuss an upcoming feature in MemSQL 6.5 showing how advanced stream processing use cases can be tackled with a combination of stored procedures (new in 6.0) and MemSQL's pipelines feature.
The document describes Curriculum Associates' journey to develop a real-time application architecture to provide teachers and students with real-time feedback. They started with batch ETL to a data warehouse and migrated to an in-memory database. They added Kafka message queues to ingest real-time event data and integrated a data lake. Now their system uses MemSQL, Kafka, and a data lake to provide real-time and batch processed data to users.
Learn how to leverage MPP technology and distributed data to deliver high volume transactional and analytical work loads which result in real time dashboards on rapidly changing data using standard SQL tools. Demonstrations will include the streaming of structured and JSON data from Kafka messages through a micro-batch ETL process into the MemSQL database where the data is then queried using standard SQL tools and visualized leveraging Tableau.
This session will focus on image recognition, the techniques available, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition.
LIVE DEMO: Constructing and executing a real-time image recognition pipeline using Kafka and Spark.
Speaker: Neil Dahlke, MemSQL Senior Solutions Engineer
The document discusses real-time image recognition using Apache Spark. It describes how images are analyzed to extract histogram of oriented gradients (HOG) descriptors, which are stored as feature vectors in a MemSQL table. Similar images can then be identified by comparing feature vectors using dot products, enabling searches of millions of images per second. A demo is shown generating HOG descriptors from an image and storing them as a vector for fast similarity matching.
The State of the Data Warehouse in 2017 and BeyondSingleStore
The document provides an overview of the changing analytic environment and the evolution of the data warehouse. It discusses how new requirements like performance, usability, optimization, and ecosystem integration are driving the adoption of a real-time data warehouse approach. A real-time data warehouse is described as having low latency ingestion, in-memory and disk-optimized storage, and the ability to power both operational and machine learning applications. Examples are given of companies using a real-time data warehouse to enable real-time analytics and improve business processes.
How Database Convergence Impacts the Coming Decades of Data ManagementSingleStore
How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.
Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.
Teaching Databases to Learn in the World of AISingleStore
The document discusses how databases need to learn and adapt like artificial intelligence in order to power real-time applications, highlighting that databases must be simple, capable of real-time processing, and adaptable by learning behaviors and making autonomous decisions. It also promotes MemSQL's vision of teaching databases to learn by consolidating infrastructure, enabling real-time queries on fresh data, and allowing both transactions and analytics workloads.
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid CloudSingleStore
This document discusses a data warehouse blueprint for machine learning, artificial intelligence, and hybrid cloud. It provides a live demonstration of k-means clustering in SQL with MemSQL. The demonstration loads YouTube tag data, sets up k-means clustering functions using MemSQL extensibility, runs the k-means algorithm to train the data, and outputs insights into important tags and representative channels. It also briefly discusses MemSQL's capabilities for a real-time data warehouse and hybrid cloud deployments to support analytics, machine learning, and artificial intelligence workloads.
Gartner Catalyst 2017: Image Recognition on Streaming DataSingleStore
This document discusses using MemSQL to perform real-time image recognition on streaming data. Key points include:
- Feature vectors extracted from images using models like TensorFlow can be stored in MemSQL tables for analysis.
- MemSQL allows querying these feature vectors to find similar images based on cosine similarity calculations.
- This enables applications like detecting duplicate or illegal images in real-time streams.
James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo. James covers the architectural decisions and lessons learned building an exactly-once ingest pipeline storing raw events across in-memory row storage and on-disk columnar storage and a custom metalanguage and query layer leveraging partial OLAP result set caching and query canonicalization. Putting all the pieces together provides thousands of Uber employees with subsecond p95 latency analytical queries spanning hundreds of millions of recent events.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Getting It Right Exactly Once: Principles for Streaming Architectures
1. Getting It Right Exactly Once:
Principles for Streaming Architectures
Darryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies
September 2016 | Strata+Hadoop World, NY
2. 2
Getting Started
I’m Darryl Smith
• Chief Data Platform Architect
and Distinguished Engineer
Dell Technologies
Agenda
• Real-Time And The Need For Streaming
• Adding Real-Time And Streaming To The Data Lake
• Results, Plans, Lessons Learned
• Demonstration
3. 3
Trickle, Flood, or Torrent…
Streaming is about
continuous data motion,
more than speed
or volume
5. The Enterprise Reality
5
Batch > Real-Time > Streaming
Enterprise Opportunities
Immediate Business Advantage
Website and Mobile
Application Logs
Internet of Things
Sensors
6. 6
The Enterprise Streaming Play
Moving from batch to real-time streams
avoids surges, normalizes compute,
and drives value
8. 8
Drive DellEMC towards a
Predictive Enterprise via
intelligent data driving agility,
increasing revenue and
productivity resulting in a
competitive advantage
Analytics Vision
9. 9
Need to use new data for
competitive advantage
• Volume, Variety and Velocity
Leverage near real time and
streaming data sets to
optimize predictions
• Make faster, better decisions
Cost-effectively scale to
improve query and load
performance
Put the data in the hands of
the business
Becoming An Analytical Enterprise
DRIVE
COMPETITIVE
ADVANTAGE
COST-
EFFECTIVELY
SCALE
DATA ACCESS
BY BUSINESS
NEAR
REAL-TIME
ANALYTICS
10. 10
Problem Statement
Teams do not have access
to maintenance renewal
quotes in the timeframes
or the degree of quality
which they need for Tech
Refresh and Renewal
sales.
Desired Outcome
Implement a cost-effective,
real-time solution that
improves productivity
and gives confidence to
produce desired outcomes
efficiently.
Scoping The Business Objectives
11. 11
Business Drivers
CURRENT REALITY
VISION FOR THE
FUTURE
TO REALIZE
THIS VISION:
IMPLEMENT
CALM
SOLUTION
PHASES AND
OPTIMZE
BUSINESS
PROCESSES
HIGH TOUCH
TACTICAL EXECUTION
LOW TOUCH SELF
SERVICE
DATE DRIVEN
PROCESSES
BUSINESS VALUE
DRIVEN PROCESSES
INEFFICENCIES &
LOST PRODUCTITY
INCREASED
PRODUCTIVITY
SILOED DATA /
LIMITED VIEWS
SINGLE VIEW OF
DATA/DATA SCORING
VARIABLE DATA
QUALITY
DATA QUALITY &
CONFIDENCE
12. 12
The Need for “CALM”
Customer Asset Lifecycle Management
For
enterprise sales
Who need
accurate and timely customer information
CALM is a
real-time application
Providing
up to the moment customer 360 dashboards
For enterprise sales
Who need accurate and timely customer information
CALM is a real-time application
Providing up to the moment customer 360
o
dashboards
Install Base
Pricing
Device Config
Contacts
Contracts
Analytics Contracts
Component
Data
Offers
Scorecard
13. 13
Data Lake Architecture
D A T A P L A T F O R M
V M W A R E V C L O U D S U I T E
E X E C U T I O N
P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD
Gemfire
H A D O O P
INGESTION
DATAGOVERNANCE
Cassandra PostgreSQL MemSQL
HDFS ON ISILON
HADOOP ON SCALEIO
VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN
A N A L Y T I C S
T O O L B O X
Network WebSensor SupplierSocial Media Market
S T R U C T U R E DU N S T R U C T U R E D
CRM PLMERP
APPLICATIONS
ApacheRangerAttivioCollibra
Real-TimeMicro-BatchBatch
14. 14
Data Ingestion
• Small to Big Data (high-throughput)
• Structured and unstructured Data from any Source
• Streams and Batches
• Secure, multi-tenant, configurable Framework
Real-Time Analytics
• Tap into streams for in-memory Analytics
• Real Time Data insights and decisions
Services
• Data Ingestion to Data Lake
• Data Lake APIs
• Data Alerting
Business Data Lake Offerings
Unstructured
Structured
16. 16
Seeking A Fast Database
A compliment to the business data lake
O P C M
17. HammerDB Platform Benchmarks
HammerDB workloads testing was done following EMC’s Oracle and SQL Server
DBA Teams standard practices.
Definition of workload. Mix of 5 transactions as follows:
• New order: receive a new order from a customer: 45%
• Payment: update the customer balance to record a payment: 43%
• Delivery: deliver orders asynchronously: 4%
• Order status: retrieve the status of customer’s most recent order: 4%
• Stock level: return the status of the warehouse’s inventory: 4%
Testing scenario:
• 100 warehouses 8 vUsers. Database creation and initial data loading.
• Timed testing. 20 minutes per each testing session.
• Scaled number of virtual users for each testing session from 1 until 44.
No changes done to the systems and databases configuration while running the
test.
18. HammerDB Workload Testing
Each test was 16 vCPU x 32 GB RAM
• RedHat 6.4
• Oracle 11g R2
• Windows Core 2012 R2
• SQL Server 2012 Ent Ed.
• RedHat 6.4
• PostgreSQL 9.3.3
20. Query PostgreSQL MemSQL
Opportunity(5K) 5 seconds 200ms
Sales Order(170K) 1-1.5 Minutes 6 seconds
Territory(60K) 60 seconds 5 seconds
PostgreSQL vs In-Memory DB
We picked 5 top queries run by different business functions.
Presented here are 3 queries that had response times that did not meet the SLA.
21. 21
Business Data Lake – Ingestion to Fulfillment
Raw Data
Summary
Data
DATAGOVERNOR
Consumers
Predictive/
Prescriptive
Analytics
Processed
Data
Analytical Data
GREENPLUM DATABASE
HADOOP
RAW
Data
INGEST
MANAGER
SPRING XD
SPARK
SQOOP
Execution Tier
CASSANDRAGEMFIRE
MEMSQL POSTGRESQL
Real-Time
Tap
22. 22
Here Are The Data Flows We Built
Low Velocity
Batch
Real-Time
23. 23
Data Flow Patterns – Low Velocity
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
Presentation [SPEED/SERVING]
GREENPLUM
DATABASE
PIVOTAL HD
POSTGRESQL
MEMSQL
Raw
Data
One-Time
CASSANDRA
GEMFIRE
25. 25
Data Flow Patterns – Real Time
Real-time
Initial Load
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
GREENPLUM
DATABASE
PIVOTAL HD
Presentation [SPEED/SERVING]
POSTGRESQL
MEMSQL CASSANDRA
GEMFIRE
26. 26
Nothing Closer To Real Time Than Streaming
Let’s look at the leading edge
Apache Kafka
Messaging Semantics
• At most once
• At least once
• Exactly once
30. 30
Understanding Streaming Semantics
At most once At least once Exactly once
Message pulled once Message pulled one or
more times;
processed each time
Message pulled one or
more times;
processed once
May or may not be
received
Receipt guaranteed Receipt guaranteed
No duplicates Likely duplicates No duplicates
Possible missing data No missing data No missing data
000
? 000000
?
01
01
01
31. 31
Rendering In Real Time
Picking the right business intelligence layer
• Tableau
• Custom Application (CF, D3, Docker)
• Additional Third Party Solutions
33. 33
Business Benefits
DATA QUERYING
Down from 4 hours per quarter
to less than 1 minute per year
SIMPLIFIED
PROVISIONING
Reduced number of tables/report
required
DATA
GOVERNANCE
Provides one version of
the truth
TIME TO MARKET
Reduced number of tables/report
required
TOOL
AGNOSTIC
Business logic in the DB not
the tool provides increased
flexibility
34. 34
Use Case: Customer Account Profile
STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW
Service Request
Contracts
Installed Base
Bookings
Billings
EMC DATA
LAKE
BDL
SERVICES
DATA
WORKSPACES
DATA INGESTION
Prof Services
23 BUSINESS MANAGED WORKSPACES
35. 35
Customer Asset Lifecycle Management
Platform Roadmap
Phase 1 : Foundational
Capabilities/Discovery
Phase 2 : Scale Platform /
Automate
Future Phases : Global Standard tool
Integrations , advanced Analytics
BAaaS/Tableau
Scalable
Platform
Integrated
Platform
GBS
Renewals
Inside
Sales
Additional
Business groups
Oct 2015 2016 TBDAug 2015
BDL Platform
Enablement CollaborationAcceleration
In-Memory Capabilities
(POC)
We are here
36. 36
Data Services Roadmap
Security
Planned integration into
custom BDL security API for
managing Role Based Access
Control (RBAC) to the
underlying data
Business Data Lake Plans
37. 37
Lessons Learned – Key Takeaways
EDUCATE ASSESS INFRASTRUCTURE JOURNEY
Educate the
business
Use examples of
business impact
Assess in-house
big data skills
Ensure plan to
support the
organization for 3-
5 years
Choose the best
possible infrastructure
Make sure your Big
Data technology
platform can evolve
Remember it is a
journey
Look for small wins
as well as big wins.
38. 38
Lessons Learned: Analytics and Data
Sourcing the right skills, working with a different philosophy,
and some new tools will help you meet your analytical goals
TRANSFORM YOUR
PEOPLE
CHANGE YOUR
PROCESSES
ADAPT YOUR
TECHNOLOGY
Data science in the
organization, IT or both?
Helping business units
take initiative
New philosophy to
running analytics projects
How and when to share
data
Steadily refine toolsets
based on needed analysis
Identify to infrastructure
layers
40. 40
Demo Agenda
Showcase exactly-once semantics from Kafka
1: Data set of 200,000 transactions summing to zero
2: CREATE TABE AND CREATE PIPELINE
3: Push to Kafka and confirm exactly-once
4: Validate Resiliency and confirm exactly-once
41. Step 1: Data Source
start with a data set of 200,000 transactions representing
money/goods that sum to zero
45. Step 3: Push to Kafka
Push that data set to Kafka
Validate exactly-once delivery by querying MemSQL
• show tables;
• show pipelines;
• select sum(amount) from transactions;
Should be 0 in the demo
• select count(*) from transactions;
Should be 200,000 in the demo
47. Step 4: Resiliency
induce a failures to show resiliency during exactly-once
workflows
a. randomly_fail_batches.py
b. restart Kafka and show error count
c. continue and validate exactly-once semantics