Learn how to leverage MPP technology and distributed data to deliver high volume transactional and analytical work loads which result in real time dashboards on rapidly changing data using standard SQL tools. Demonstrations will include the streaming of structured and JSON data from Kafka messages through a micro-batch ETL process into the MemSQL database where the data is then queried using standard SQL tools and visualized leveraging Tableau.
This session will focus on image recognition, the techniques available, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition.
LIVE DEMO: Constructing and executing a real-time image recognition pipeline using Kafka and Spark.
Speaker: Neil Dahlke, MemSQL Senior Solutions Engineer
How Database Convergence Impacts the Coming Decades of Data ManagementSingleStore
How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.
Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Why would I store my data in more than one database?Kurtosys Systems
Moving away from a single system solution
When you think of data, the chances are that you will think of documents and spreadsheets or images and emails. From time to time there are which have come along and increased the amount of data that an individual stores, but historically if you owned a file you also had physical possession of the bits and bytes that formed it.
Learn how to leverage MPP technology and distributed data to deliver high volume transactional and analytical work loads which result in real time dashboards on rapidly changing data using standard SQL tools. Demonstrations will include the streaming of structured and JSON data from Kafka messages through a micro-batch ETL process into the MemSQL database where the data is then queried using standard SQL tools and visualized leveraging Tableau.
This session will focus on image recognition, the techniques available, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition.
LIVE DEMO: Constructing and executing a real-time image recognition pipeline using Kafka and Spark.
Speaker: Neil Dahlke, MemSQL Senior Solutions Engineer
How Database Convergence Impacts the Coming Decades of Data ManagementSingleStore
How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.
Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Why would I store my data in more than one database?Kurtosys Systems
Moving away from a single system solution
When you think of data, the chances are that you will think of documents and spreadsheets or images and emails. From time to time there are which have come along and increased the amount of data that an individual stores, but historically if you owned a file you also had physical possession of the bits and bytes that formed it.
MemSQL 201: Advanced Tips and Tricks WebcastSingleStore
Topics discussed include differences between columnstore and rowstore engines, data ingestion, data sharding and query tuning, lastly memory and workload management.
Watch the replay at https://memsql.wistia.com/medias/4siccvlorm
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
The database market is large and filled with many solutions. In this talk, Seth Luersen from MemSQL we will take a look at what is happening within AWS, the overall data landscape, and how customers can benefit from using MemSQL within the AWS ecosystem.
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.
Bringing olap fully online analyze changing datasets in mem sql and spark wi...SingleStore
As the world moves from batch to online data processing, real-time data pipelines will supercede siloed data warehouse and transaction processing systems as core infrastructure.
While many analytics solutions tout query execution speed, this is only half of the equation.
For real time workloads, stale data renders query speed irrelevant when results and insights are out of date.
Beyond just “online queries,” real-time enterprises need “online datasets” that continuously update and make data accessible across the organization.
This session will cover approaches to building real-time pipelines with MemSQL, Hadoop, and Spark. Topics will include:
Key industry trends and the move to real-time data pipelines
How MemSQL customer Novus built the premier financial portfolio management platform using MemSQL as a real-time data store and query engine.
Operationalizing Spark for Advanced Analytics
Demonstration of how Pinterest is using the MemSQL Spark Connector to derive real-time insights on interesting and meaningful user activity with MemSQL and Spark.
Introduction to the MemSQL Spark Connector
Strategies for integrating Spark and Hadoop with real-time systems for transaction processing and operational analytics.
Presenters include MemSQL CEO Eric Frenkiel, Novus CTO Robert Stepeck, and Pinterest Software Engineer Yu Yang.
In a world of web portals and push notifications, users have developed demanding expectations for a real-time experience. Continuous updates, a responsive interface, and short loading times have become the norm. Most business analysts and data scientists, whose workflows remain bound by legacy tools and complex data pipelines, lack this fast, simple user experience.
From a business perspective, latency and complexity impede revenue by preventing access to the right data at the right time. Businesses that recognize the value of access to real-time data now have options to meet stringent objectives. They understand that serving “always up to date” data for analysis requires converging transactions and analytics in a real-time system. This session will highlight these architectures and customer achievements.
Operationalizing Machine Learning Using GPU-accelerated, In-database AnalyticsKinetica
Mate Radalj's presentation on how to operationalize machine learning using GPU-accelerated, in-database analytics, given at the Bay Area GPU-Accelerated Computing Meetup on October 19, 2017. Presentation includes use cases and links to demos.
Apache Cassandra Lunch #71: Creating a User Profile Using DataStax Astra and ...Anant Corporation
In Cassandra Lunch #71, we will discuss how DataStax Astra can be used as a back-end for a React client. We will demo a small application with a user profile.
Accompanying Blog: https://blog.anant.us/creating-a-user-profile-using-datastax-astra/
Accompanying YouTube: https://youtu.be/7n4PsYhGIfM
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Architecture Best Practices to Master + Pitfalls to AvoidElasticsearch
Scale with confidence. From deploying a small dev cluster for app search to managing a prod deployment of hundreds of nodes, our Elastic experts have seen it all and they’re sharing everything you need to know.
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
A Carrier grade PaaS aims to bring the network and application together. That means that application can easily deployed on multiple sites on different physical location as if it was one big data centre. Unlike regular cloud environment where application need to explicitly handle multi-zone deployments with Carrier PaaS the application workload and availability is handled through policy driven approach. The policy describes the desired application SLA and the Carrier PaaS maps the deployment of the application resources on the cloud node that best fit the latency, load or availability requirements.
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
Azure SQL Database is a fully managed cloud database service with built-in intelligence, elastic scale, performance, reliability, and data protection that enables enterprises and ISVs to reduce their total cost of ownership and operational cost and overheads. In this session, I will share real-world experience of successfully migrated existing SaaS application and on-premises workload for some our tier 1 customers and ISV partners to Azure SQL Database service. The session walks through planning, assessment, migration tools and best practices from the proven experiences and practices of migrating real world applications to Azure SQL Database service.
Building Identity Graphs over Heterogeneous DataDatabricks
In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together.
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
How SkyElectric Uses Scylla to Power Its Smart Energy PlatformScyllaDB
SkyElectric provides a smart solar energy system that delivers clean, reliable energy in the developing world. Learn how SkyElectric outgrew its MySQL system and moved to Scylla to scale its IoT-driven time-series data management services.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovDatabricks
The future of computing is visual. With everything from smartphones to Spectacles, we are about to see more digital imagery and associated processing than ever before.
In conjunction, new computing models are rapidly appearing to help data engineers harness the power of this imagery. Vast resources with cloud platforms, and the sharing of processing algorithms, are moving the industry forward quickly. The models are readily available as well.
This session will examine the image recognition techniques available with Apache Spark, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition. In particular, this session will showcase the use of Spark in conjunction with a high-performance database to operationalize these workflows.
Learn about a combination of:
-Architectural considerations in building and image recognition pipeline
-Advantages and pitfalls of specific approaches
-Real-time capabilities for instant matches
-Use of a fast relational datastore to persist data from Spark
You’ll also see a live demonstration on constructing and executing a real-time image recognition pipeline.to persist data from Spark.
MemSQL 201: Advanced Tips and Tricks WebcastSingleStore
Topics discussed include differences between columnstore and rowstore engines, data ingestion, data sharding and query tuning, lastly memory and workload management.
Watch the replay at https://memsql.wistia.com/medias/4siccvlorm
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
The database market is large and filled with many solutions. In this talk, Seth Luersen from MemSQL we will take a look at what is happening within AWS, the overall data landscape, and how customers can benefit from using MemSQL within the AWS ecosystem.
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.
Bringing olap fully online analyze changing datasets in mem sql and spark wi...SingleStore
As the world moves from batch to online data processing, real-time data pipelines will supercede siloed data warehouse and transaction processing systems as core infrastructure.
While many analytics solutions tout query execution speed, this is only half of the equation.
For real time workloads, stale data renders query speed irrelevant when results and insights are out of date.
Beyond just “online queries,” real-time enterprises need “online datasets” that continuously update and make data accessible across the organization.
This session will cover approaches to building real-time pipelines with MemSQL, Hadoop, and Spark. Topics will include:
Key industry trends and the move to real-time data pipelines
How MemSQL customer Novus built the premier financial portfolio management platform using MemSQL as a real-time data store and query engine.
Operationalizing Spark for Advanced Analytics
Demonstration of how Pinterest is using the MemSQL Spark Connector to derive real-time insights on interesting and meaningful user activity with MemSQL and Spark.
Introduction to the MemSQL Spark Connector
Strategies for integrating Spark and Hadoop with real-time systems for transaction processing and operational analytics.
Presenters include MemSQL CEO Eric Frenkiel, Novus CTO Robert Stepeck, and Pinterest Software Engineer Yu Yang.
In a world of web portals and push notifications, users have developed demanding expectations for a real-time experience. Continuous updates, a responsive interface, and short loading times have become the norm. Most business analysts and data scientists, whose workflows remain bound by legacy tools and complex data pipelines, lack this fast, simple user experience.
From a business perspective, latency and complexity impede revenue by preventing access to the right data at the right time. Businesses that recognize the value of access to real-time data now have options to meet stringent objectives. They understand that serving “always up to date” data for analysis requires converging transactions and analytics in a real-time system. This session will highlight these architectures and customer achievements.
Operationalizing Machine Learning Using GPU-accelerated, In-database AnalyticsKinetica
Mate Radalj's presentation on how to operationalize machine learning using GPU-accelerated, in-database analytics, given at the Bay Area GPU-Accelerated Computing Meetup on October 19, 2017. Presentation includes use cases and links to demos.
Apache Cassandra Lunch #71: Creating a User Profile Using DataStax Astra and ...Anant Corporation
In Cassandra Lunch #71, we will discuss how DataStax Astra can be used as a back-end for a React client. We will demo a small application with a user profile.
Accompanying Blog: https://blog.anant.us/creating-a-user-profile-using-datastax-astra/
Accompanying YouTube: https://youtu.be/7n4PsYhGIfM
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Architecture Best Practices to Master + Pitfalls to AvoidElasticsearch
Scale with confidence. From deploying a small dev cluster for app search to managing a prod deployment of hundreds of nodes, our Elastic experts have seen it all and they’re sharing everything you need to know.
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
A Carrier grade PaaS aims to bring the network and application together. That means that application can easily deployed on multiple sites on different physical location as if it was one big data centre. Unlike regular cloud environment where application need to explicitly handle multi-zone deployments with Carrier PaaS the application workload and availability is handled through policy driven approach. The policy describes the desired application SLA and the Carrier PaaS maps the deployment of the application resources on the cloud node that best fit the latency, load or availability requirements.
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
Azure SQL Database is a fully managed cloud database service with built-in intelligence, elastic scale, performance, reliability, and data protection that enables enterprises and ISVs to reduce their total cost of ownership and operational cost and overheads. In this session, I will share real-world experience of successfully migrated existing SaaS application and on-premises workload for some our tier 1 customers and ISV partners to Azure SQL Database service. The session walks through planning, assessment, migration tools and best practices from the proven experiences and practices of migrating real world applications to Azure SQL Database service.
Building Identity Graphs over Heterogeneous DataDatabricks
In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together.
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
How SkyElectric Uses Scylla to Power Its Smart Energy PlatformScyllaDB
SkyElectric provides a smart solar energy system that delivers clean, reliable energy in the developing world. Learn how SkyElectric outgrew its MySQL system and moved to Scylla to scale its IoT-driven time-series data management services.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovDatabricks
The future of computing is visual. With everything from smartphones to Spectacles, we are about to see more digital imagery and associated processing than ever before.
In conjunction, new computing models are rapidly appearing to help data engineers harness the power of this imagery. Vast resources with cloud platforms, and the sharing of processing algorithms, are moving the industry forward quickly. The models are readily available as well.
This session will examine the image recognition techniques available with Apache Spark, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition. In particular, this session will showcase the use of Spark in conjunction with a high-performance database to operationalize these workflows.
Learn about a combination of:
-Architectural considerations in building and image recognition pipeline
-Advantages and pitfalls of specific approaches
-Real-time capabilities for instant matches
-Use of a fast relational datastore to persist data from Spark
You’ll also see a live demonstration on constructing and executing a real-time image recognition pipeline.to persist data from Spark.
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsWee Hyong Tok
In this session, we will share about cutting-edge deep learning innovations, and present emerging trends in the AI community. This session is for data scientists, developers who have a keen interest in getting started in an AI project, and wants to learn the tools of the trade. We will draw on practical experiences from working on various AI projects, and share the key learning, and pitfalls
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
Leveraging your operational data for advanced and predictive analytics enables deeper insights and greater value for cloud applications. DSE Analytics is a complete platform for Operational Analytics, including data ingestion, stream processing, batch analysis, and machine learning.
In this talk we will provide an overview of DSE Analytics as it applies to data science tools and techniques, and demonstrate these via real world use cases and examples.
Brian Hess
Rob Murphy
Rocco Varela
About the Speakers
Brian Hess Senior Product Manager, Analytics, DataStax
Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, such as fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing, which can utilize cluster computing to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS.
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
Matei Zaharia is an assistant professor of computer science at Stanford University, Chief Technologist and Co-founder of Databricks. He started the Spark project at UC Berkeley and continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei’s research work on datacenter systems was recognized through two Best Paper awards and the 2014 ACM Doctoral Dissertation Award.
“A broad category of applications and technologies for gathering, storing, analyzing, sharing and providing access to data to help enterprise users make better business decisions” -Gartner
The primary focus of this presentation is approaching the migration of a large, legacy data store into a new schema built with Django. Includes discussion of how to structure a migration script so that it will run efficiently and scale. Learn how to recognize and evaluate trouble spots.
Also discusses some general tips and tricks for working with data and establishing a productive workflow.
Python business intelligence (PyData 2012 talk)Stefan Urbanek
What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data.
Video: https://vimeo.com/53063944
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python.
System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.
NoSQL Deepdive - with Informix NoSQL. IOD 2013Keshav Murthy
Deepdive into IBM Informix NoSQL with Mongodb API, hybrid access. Informix now supports MongoDB API and stores JSON natively, thus supporting flexible schema. Informix NoSQL also supports sharding, enabling scale out. This presentation gives overview to a real application to technical details of this
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Similar to Gartner Catalyst 2017: Image Recognition on Streaming Data (20)
Converging Database Transactions and Analytics SingleStore
delivered at the Gartner Data and Analytics 2018 show in Texas. This presentation discusses real-time applications and their impact on existing data infrastructures
Building a Fault Tolerant Distributed ArchitectureSingleStore
This talk will highlight some of the challenges to building a fault tolerant distributed architecture, and how MemSQL's architecture tackles these challenges.
Stream Processing with Pipelines and Stored ProceduresSingleStore
This talk will discuss an upcoming feature in MemSQL 6.5 showing how advanced stream processing use cases can be tackled with a combination of stored procedures (new in 6.0) and MemSQL's pipelines feature.
James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo. James covers the architectural decisions and lessons learned building an exactly-once ingest pipeline storing raw events across in-memory row storage and on-disk columnar storage and a custom metalanguage and query layer leveraging partial OLAP result set caching and query canonicalization. Putting all the pieces together provides thousands of Uber employees with subsecond p95 latency analytical queries spanning hundreds of millions of recent events.
Machines and the Magic of Fast LearningSingleStore
Human-machine interaction is no longer the exclusive province of science fiction. The advance of the internet and connected devices has inspired data scientists to create machine-learning applications to extract value from these new forms of data.
So what's the next frontier?
Join MemSQL Engineer Michael Andrews and Sr. Director Mike Boyarski to learn how to use real-time data as a vehicle for operationalizing machine-learning models. Michael and Mike will explore advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change.
You will learn:
Top technologies for building the ideal machine-learning stack
How to power machine-learning applications with real-time data
A use case and demo of machine learning for social good
"Building Real-Time Data Pipelines with Kafka and MemSQL" by Rick Negrin, Director of Product Management at MemSQL for Orange County Roadshow March 17, 2017.
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore
Robin Li, Director of Data Engineering and Yohan Chin, VP Data Science at Tapjoy share how to architect the best application experience for mobile users using technologies including Apache Kafka, Apache Spark, and MemSQL.
Speaker: Robin Li - Director of Data Engineering, Tapjoy and Yohan Chin - VP Data Science, Tapjoy
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
2. AT MEMSQL
Senior Solutions Engineer, San Francisco
BEFORE MEMSQL
I worked on Globus, a high performance data transfer tool for
research scientists, out of the University of Chicago in
coordination with Argonne National Lab.
PREVIOUS TALKS
Real Time, Geospatial, Maps (slides)
Streaming in the Enterprise (slides)
Real Time Analytics with Spark and MemSQL (slides)
2
Me at a Glance
10. Easy to setup real-time data
pipelines with exactly-once
semantics
Streaming Data Ingest
Memory optimized tables for
analyzing real-time events
Live Data
Disk optimized tables with up to
10x compression and vectorized
queries for fast analytics
Historical Data
10
MemSQL at a Glance
11. Data Loading Query Latency
Concurrency
FAST LOW
Vectorized queries
Real-time dashboards
Live data access
Multi-threaded processing
Transactions and Analytics
Scalable performance
HIGH
Stream data
Real-time loading
Full data access
11
12. • Distributed, ANSI SQL, database
• Full ACID features
• Lock free, shared nothing
• Compiled queries
• Massively parallel
• Geospatial and JSON
• In-memory and on-disk
• MySQL protocol
• Streaming
• HTAP (rowstore and columnstore)
MemSQL in One Slide
12
17. Architecture: It’s SQL All The Way Down
Agg 1 Agg 2
agg1> select avg(price) from
orders;
leaf1> using memsql_demo_0
select count(1), sum(price)
from orders;
leaf2> using memsql_demo_12
select count(1), sum(price)
from orders;
...
Leaf 1 Leaf 2 Leaf 3 Leaf 4
17
22. 22
Real-Time Image Recognition Workflow
▪ Train a model with Spark and TensorFlow
▪ Use the Model to extract feature vectors from images
• Model + Image => FV
▪ You can store every feature vector in a MemSQL table
CREATE TABLE features (
id bigint(11) NOT NULL,
image binary(4096) DEFAULT NULL,
KEY id (id)USING CLUSTERED COLUMNSTORE
);
26. 26
Working with Feature Vectors
For every image we store an ID and a normalized feature vector in a MemSQL table called
features.
ID | Feature Vector
x | 4KB
To find similar images using cosine similarity, we use this SQL query:
SELECT
id
FROM
feature_vectors
WHERE
DOT_PRODUCT(image, 0xDEADBEEF) > 0.9
28. 28
Understanding Dot Product
▪ Dot Product is an algebraic operation
• X = (x1, …, xN), Y = (y1, …, yN)
• (X*Y) = SUM(Xi * Yi)
▪ With the specific model and normalized feature vectors
DOT PRODUCT results in a similarity score.
• The closer the score is to 1 the more similar are the images
29. 29
Performance numbers
▪ Memory speed: ~50GB/sec
▪ Vector size: 4KB
▪ 12.5 Million Images a second per node
▪ 1 Billion images a second on 100 node cluster
30. 30
Performance Enhancing Techniques
Achieving best-in-class dot product implementation
▪ SIMD-powered
▪ Data compression
▪ Query parallelism
▪ Scale out
▪ Result: Processing at Memory Bandwidth Speed
32. Load
Ingest from Apache Kafka,
Amazon S3 or Kinesis
Guarantee message
delivery with exactly-once
semantics
Transform
Map and enrich data with
user defined or Apache
Spark transformations
MemSQL Streaming
Extract
32
33. memsql> CREATE PIPELINE features_pipeline AS
-> LOAD DATA KAFKA "public-kafka.memcompute.com:9092/features"
-> INTO TABLE feature_vectors
-> (id, feature);
-> FIELDS TERMINATED BY ',’
-> (id, @image)
-> SET image = UNHEX(SUBSTRING(@image, 3));
Query OK, (0.89 sec)
memsql> START PIPELINE features_pipeline;
Query OK, (0.01 sec)
33
Simple Streaming Setup with CREATE PIPELINE
Normalize each feature vector by dividing each element in the vector by the length of the vector, which gives us a length of 1
The similarity is higher if the dot product is close to one.
is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors) and returns a single number. In Euclidean geometry, the dot product of the Cartesian coordinates of two vectors is widely used and often called inner product (or rarely projection product); see also inner product space.
Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers.