Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
Introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures (e.g., streaming, real-time intelligence, and analytics). We will review the AWS big data portfolio of services including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), Redshift, Aurora and Machine Learning, and learn how customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
These slides were presented at Cloud Expo West 2010, covering what it takes to scale your databases in the cloud -- keeping them fully reliable as well.
An introduction to self-service data with Dremio. Dremio reimagines analytics for modern data. Created by veterans of open source and big data technologies, Dremio is a fundamentally new approach that dramatically simplifies and accelerates time to insight. Dremio empowers business users to curate precisely the data they need, from any data source, then accelerate analytical processing for BI tools, machine learning, data science, and SQL clients. Dremio starts to deliver value in minutes, and learns from your data and queries, making your data engineers, analysts, and data scientists more productive.
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
Introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures (e.g., streaming, real-time intelligence, and analytics). We will review the AWS big data portfolio of services including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), Redshift, Aurora and Machine Learning, and learn how customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
These slides were presented at Cloud Expo West 2010, covering what it takes to scale your databases in the cloud -- keeping them fully reliable as well.
An introduction to self-service data with Dremio. Dremio reimagines analytics for modern data. Created by veterans of open source and big data technologies, Dremio is a fundamentally new approach that dramatically simplifies and accelerates time to insight. Dremio empowers business users to curate precisely the data they need, from any data source, then accelerate analytical processing for BI tools, machine learning, data science, and SQL clients. Dremio starts to deliver value in minutes, and learns from your data and queries, making your data engineers, analysts, and data scientists more productive.
Extending Machine Learning Algorithms with PySparkDatabricks
Machine learning practitioners are most comfortable using high-level programming languages such as Python. This is a barrier to parallelizing algorithms with big data frameworks such as Apache Spark, which are written in lower-level languages. Databricks partnered with the Regeneron Genetics Center to create the Glow library for population-scale genomics data storage and analytics. Glow V1.0.0 includes PySpark-based implementations for both existing and novel machine learning algorithms. We will discuss how leveraging tooling for Python users, especially Pandas UDFs, accelerated our development velocity and impacted our algorithms’ computational performance.
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
It is a common believe that Hadoop should run on physical servers. However, this requires huge capital investment in the beginning while you have no guarantee for the returns. Therefore, things usually end up in proving big-data with not-that-big data. One approach to workaround this dilemma is to run Cloud Computing in the Cloud. With the elastic that AWS provides, you could spend little but run big!! However, is it really a good idea? In this sharing, we will try to answer it – based on the result from an 1-year journey with real application and real big-data.
Seamless, Real-Time Data Integration with ConnectPrecisely
As many of our customers have come to learn - integrating legacy data into modern data architecture is easier said than done! View this on-demand webinar to learn all about Precisely's seamless data integration solutions and how they have helped thousands of customers like you trust their data.
Learn about the two flavors of Precisely's Connect:
• Collect, prepare, transform and load your data to various targets using Connect ETL with the flexibility of using clusters and running on many different environments. With our 'design once, deploy anywhere' feature; what is built on prem today, can run on a cloud platform tomorrow with no development or mainframe expertise required.
• Capture data changes in real-time with no coding, tuning or performance impact using Connect CDC. Replicating exactly WHAT you need and HOW you need it with over 80 built-in data transformation methods.
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/rKoBJcnsFpM
Speaker's Bio:
Mateusz is a software developer who loves all things distributed, machine learning and hates buzzwords. His favourite hobby data juggling. He obtained his M.Sc. in Computer Science from AGH UST in Krakow, Poland, during which he did an exchange at L’ECE Paris in France and worked on distributed flight booking systems. After graduation he move to Tokyo to work as a researcher at Fujitsu Laboratories on machine learning and NLP projects, where he is still currently based. In his spare time he tries to be part of the IT community by organizing, attending and speaking at conferences and meet ups.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Manage your workloads, not the infrastructure. This session covers how to use Azure Batch to build cloud-native Big Compute and HPC solutions. Batch provides APIs for creating and managing pools of resources and running jobs on them. It handles VM provisioning, management, and job scheduling under the hood, so your developers don’t have to.
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEIBig Data Week
Alex Bordei is a developer turned Product Manager. He has been developing infrastructure products for over nine years. Before becoming Bigstep’s Product Manager, he was one of the core developers for Hostway Corporation’s provisioning platform. He then focused on defining and developing products for Hostway’s EMEA market and was one of the pioneers of virtualization in the company. After successfully launching two public clouds based on VMware software, he created the first prototype of Bigstep’s Full Metal Cloud in 2011. He now focuses on guaranteeing that the Full Metal Cloud is the highest performance cloud in the world, for big data applications.
Deploying Real-Time Decision Services Using Redis with Tague GriffithDatabricks
Most of the energy and attention in machine learning focused on the model training side of the problem. Multiple frameworks, in every language, provide developers with access to a host of data manipulation and training algorithms, but until recently developers had virtually no frameworks to build out predictive engines from trained ML models. Most developers resorted to building custom applications, but building highly available, highly performant applications is difficult.
Redis in conjunction with the Redis-ML module provides a server framework for developers to build predictive engines with familiar, off-the-shelf components. Developers can take advantage of all the features of Redis to deliver faster and more reliable prediction engines with less custom development.
This talk is a technical session which examines how Redis can be used in conjunction with a Spark based training platform to deliver real-time predictive and decision making features as part of a larger system. To set the context for the session, we start with an introduction to the Redis data model and how features of Redis (namespace, replication) can be used to build fast predictive engines (at scale), that are more reliable, more feature rich and easier to manage than custom applications. From there, we look at the model serving capabilities of Redis-ML and how they can be integrated with a Spark-based ML pipeline to automate the entire model development process from training to deployment.
The session ends with a demonstration of a simple machine learning pipeline. Using Spark we train several example models, load them directly into Redis and demonstrate Redis as the predictive engine for making real-time recommendations. At the end of the session, developers should feel confident that they could use Redis as a server framework to build a predictive serving engine for a Spark-based ML pipeline.
Extending Machine Learning Algorithms with PySparkDatabricks
Machine learning practitioners are most comfortable using high-level programming languages such as Python. This is a barrier to parallelizing algorithms with big data frameworks such as Apache Spark, which are written in lower-level languages. Databricks partnered with the Regeneron Genetics Center to create the Glow library for population-scale genomics data storage and analytics. Glow V1.0.0 includes PySpark-based implementations for both existing and novel machine learning algorithms. We will discuss how leveraging tooling for Python users, especially Pandas UDFs, accelerated our development velocity and impacted our algorithms’ computational performance.
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
It is a common believe that Hadoop should run on physical servers. However, this requires huge capital investment in the beginning while you have no guarantee for the returns. Therefore, things usually end up in proving big-data with not-that-big data. One approach to workaround this dilemma is to run Cloud Computing in the Cloud. With the elastic that AWS provides, you could spend little but run big!! However, is it really a good idea? In this sharing, we will try to answer it – based on the result from an 1-year journey with real application and real big-data.
Seamless, Real-Time Data Integration with ConnectPrecisely
As many of our customers have come to learn - integrating legacy data into modern data architecture is easier said than done! View this on-demand webinar to learn all about Precisely's seamless data integration solutions and how they have helped thousands of customers like you trust their data.
Learn about the two flavors of Precisely's Connect:
• Collect, prepare, transform and load your data to various targets using Connect ETL with the flexibility of using clusters and running on many different environments. With our 'design once, deploy anywhere' feature; what is built on prem today, can run on a cloud platform tomorrow with no development or mainframe expertise required.
• Capture data changes in real-time with no coding, tuning or performance impact using Connect CDC. Replicating exactly WHAT you need and HOW you need it with over 80 built-in data transformation methods.
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/rKoBJcnsFpM
Speaker's Bio:
Mateusz is a software developer who loves all things distributed, machine learning and hates buzzwords. His favourite hobby data juggling. He obtained his M.Sc. in Computer Science from AGH UST in Krakow, Poland, during which he did an exchange at L’ECE Paris in France and worked on distributed flight booking systems. After graduation he move to Tokyo to work as a researcher at Fujitsu Laboratories on machine learning and NLP projects, where he is still currently based. In his spare time he tries to be part of the IT community by organizing, attending and speaking at conferences and meet ups.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Manage your workloads, not the infrastructure. This session covers how to use Azure Batch to build cloud-native Big Compute and HPC solutions. Batch provides APIs for creating and managing pools of resources and running jobs on them. It handles VM provisioning, management, and job scheduling under the hood, so your developers don’t have to.
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEIBig Data Week
Alex Bordei is a developer turned Product Manager. He has been developing infrastructure products for over nine years. Before becoming Bigstep’s Product Manager, he was one of the core developers for Hostway Corporation’s provisioning platform. He then focused on defining and developing products for Hostway’s EMEA market and was one of the pioneers of virtualization in the company. After successfully launching two public clouds based on VMware software, he created the first prototype of Bigstep’s Full Metal Cloud in 2011. He now focuses on guaranteeing that the Full Metal Cloud is the highest performance cloud in the world, for big data applications.
Deploying Real-Time Decision Services Using Redis with Tague GriffithDatabricks
Most of the energy and attention in machine learning focused on the model training side of the problem. Multiple frameworks, in every language, provide developers with access to a host of data manipulation and training algorithms, but until recently developers had virtually no frameworks to build out predictive engines from trained ML models. Most developers resorted to building custom applications, but building highly available, highly performant applications is difficult.
Redis in conjunction with the Redis-ML module provides a server framework for developers to build predictive engines with familiar, off-the-shelf components. Developers can take advantage of all the features of Redis to deliver faster and more reliable prediction engines with less custom development.
This talk is a technical session which examines how Redis can be used in conjunction with a Spark based training platform to deliver real-time predictive and decision making features as part of a larger system. To set the context for the session, we start with an introduction to the Redis data model and how features of Redis (namespace, replication) can be used to build fast predictive engines (at scale), that are more reliable, more feature rich and easier to manage than custom applications. From there, we look at the model serving capabilities of Redis-ML and how they can be integrated with a Spark-based ML pipeline to automate the entire model development process from training to deployment.
The session ends with a demonstration of a simple machine learning pipeline. Using Spark we train several example models, load them directly into Redis and demonstrate Redis as the predictive engine for making real-time recommendations. At the end of the session, developers should feel confident that they could use Redis as a server framework to build a predictive serving engine for a Spark-based ML pipeline.
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...Databricks
Redis-ML is a Redis module for high performance, real-time serving of Spark-ML models. It allows users to train large complex models in Spark, and then store and query the models directly on Redis clusters. The high throughput and low latency of Redis-ML allows users to perform heavy classification operations in real time while using a minimal number of servers. This unique architecture enables significant savings in resources compared to current commonly used methods, without loss in precision or server performance.
This session will demonstrate how to build a production-level recommendation system from the ground up using Spark-ML and Redis-ML. It will also describe performance and accuracy benchmarks, comparing the results with current standard methods.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Mining software vulns in SCCM / NIST's NVDLoren Gordon
Patch management for 3rd-party software can be a significant challenge. The raw data for effective vulnerability management is available in MS’ SCCM (software inventory) and NIST’s NVD (vulnerability database). However extracting the relevant information from complex, sometimes undocumented data structures poses significant challenges.
The stage is set with a brief overview of SCCM / NVD data structures as well as a look at a (non-typical but interesting!) production environment. Then we’ll take a quick dive into data wrangling / Machine Learning fundamentals applied to this problem: feature extraction, choice of approach, algorithm choice and turning.
Once the technical challenges are resolved, the path to “Data Nirvana” can still be strewn with significant non-technical hurdles to overcome as well. We will discuss some practical “been there, done that” examples.
Guiding through a typical Machine Learning PipelineMichael Gerke
Many People are talking about AI and Machine Learning. Here's a quick guideline how to manage ML Projects and what to consider in order to implement machine learning use cases.
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...Amazon Web Services
Customers are adopting Apache Spark ‒ an open-source distributed processing framework ‒ on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, a set of machine learning algorithms included with Spark, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts, and can access data from Amazon S3, Amazon Kinesis, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin and Jupyter notebooks, and use the Spark ML pipelines API to manage your training workflow. In addition, Jasjeet Thind, Senior Director of Data Science and Engineering at Zillow Group, will discuss his organization's development of personalization algorithms and platforms at scale using Spark on Amazon EMR.
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.
Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues we’ve encountered in developing real machine learning systems at Netflix and some of the lessons we’ve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that we’ve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that we’ve fond to be useful across a wide variety of machine-learned systems.
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
This presentation gave deep dive into various machine learning and deep learning algorithms followed by an overview of the hardware and software technologies for democratization of AI including OpenPOWER/POWER9 solutions.
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...Databricks
Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.
Learn about recent advances in MongoDB in the area of In-Memory Computing (Apache Spark Integration, In-memory Storage Engine), and how these advances can enable you to build a new breed of applications, and enhance your Enterprise Data Architecture.
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
In the world of recommendation systems, there are various theories and algorithms that work together to give the best results. Among these, the core recommendation algorithm is crucial. This paper will provide an introduction to some fundamental algorithms used in recommendation systems. These algorithms are like building blocks that help make recommendations more effective.
Similar to Big Data LDN 2017: Serving Predictive Models with Redis (20)
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesMatt Stubbs
Data architecture for a challenger bank.Speaker: Jason Maude, Head of Technology Advocacy, Starling BankSpeaker Bio: Jason Maude is a coder, coach, and public speaker. He has over a decade of experience working in the financial sector, primarily in creating and delivering software. He is passionate about explaining complex technical concepts to those who are convinced that they won't be able to understand them. He currently works at Starling Bank as their Head of Technology Advocacy and host of the Starling podcast.Filmed at Skills Matter/Code Node London on 9th May 2019 as part of the Big Data LDN Meetup Blueprint Series.Meetup sponsored by DataStax.
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Matt Stubbs
Speaker: Cedrick Lunven, Developer Advocate, DataStax
Speaker Bio: Cedrick is a Developer Advocate at DataStax where he finds opportunities to share his passions by speaking about developing distributed architectures and implementing reference applications for developers. In 2013, he created FF4j, an open source framework for Feature Toggle which he still actively maintains. He is now contributor in JHipster team.
Talk Synopsis: We have all introduced more or less functional programming and asynchronous operations into our applications in order to speed up and distribute treatments (e.g., multi-threading, future, completableFuture, etc.). To build truly non-blocking components, optimize resource usage, and avoid "callback hell" you have to think reactive—everything is an event.
From the frontend UI to database communications, it’s now possible to develop Java applications as fully reactive with frameworks like Spring WebFlux and Reactor. With high throughput and tunable consistency, applications built on top of Apache Cassandra™ fit perfectly within this pattern.
DataStax has been developing Apache Cassandra drivers for years, and in the latest version of the enterprise driver we introduced reactive programming.
During this session we will migrate, step by step, a vanilla CRUD Java service (SpringBoot / SpringMVC) into reactive with both code review and live coding. Bring home a working project!
Filmed at Skills Matter/Code Node London on 9th May 2019 as part of the Big Data LDN Meetup Blueprint Series.
Meetup sponsored by DataStax.
Blueprint Series: Expedia Partner Solutions, Data PlatformMatt Stubbs
Join Anselmo for an engaging overview of the new end-to-end data architecture at Expedia Group, taking a journey through cloud and on-prem data lakes, real-time and batch processes and streamlined access for data producers and consumers. Find out how the new architecture unifies a complex mix of data sources and feeds the data science development cycle. Expedia might appear to be a market-leading travel company – in reality, it’s a highly successful technology and data science company.
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Matt Stubbs
Richard Freeman talks about how the data science team at JustGiving built KOALA, a fully serverless stack for real-time web analytics capture, stream processing, metrics API, and storage service, supporting live data at scale from over 26M users. He discusses recent advances in serverless computing, and how you can implement traditionally container-based microservice patterns using serverless-based architectures instead. Deploying Serverless in your organisation can dramatically increase the delivery speed, productivity and flexibility of the development team, while reducing the overall running, DevOps and maintenance costs.
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEMatt Stubbs
Date: 14th November 2018
Location: Customer Experience Theatre
Time: 12:30 - 13:00
Speaker: David Maitland
Organisation: Redis Labs
About: This session will cover the technology underpinning at the software infrastructure level required to deliver the instant experience to the end user and enterprises alike. Use cases and value derived by major brands will be shared in this insightful session based the world's most loved database REDIS.
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLMatt Stubbs
Date: 14th November 2018
Location: Customer Experience Theatre
Time: 11:50 - 12:20
Speaker: Perry Krug
Organisation: Couchbase
About: Who wants to see an ad today for the shoes they bought last week? Everyone knows that customer experience is driven by data: don't waste an opportunity to get them the right data at the right time. Real-time results are critical, but raw speed isn't everything: you need power and flexibility to react to changes on the fly. Come learn how market-leading enterprises are using Couchbase as their speed layer for ingestion, incremental view and presentation layers alongside Kafka, Spark and Hadoop to liberate their data lakes.
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSMatt Stubbs
Date: 13th November 2018
Location: Customer Experience Theatre
Time: 11:50 - 12:20
Speaker: Charlotte Emms
Organisation: seenit
About: How do you get your colleagues interested in the power of data? Taking you through Seenit’s journey using Couchbase's NoSQL database to create a regular, fully automated update in an easily digestible format.
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
Date: 14th November 2018
Location: Governance and MDM Theatre
Time: 10:30 - 11:00
Speaker: Mike Ferguson
Organisation: IBS
About: For most organisations today, data complexity has increased rapidly. In the area of operations, we now have cloud and on-premises OLTP systems with customers, partners and suppliers accessing these applications via APIs and mobile apps. In the area of analytics, we now have data warehouse, data marts, big data Hadoop systems, NoSQL databases, streaming data platforms, cloud storage, cloud data warehouses, and IoT-generated data being created at the edge. Also, the number of data sources is exploding as companies ingest more and more external data such as weather and open government data. Silos have also appeared everywhere as business users are buying in self-service data preparation tools without consideration for how these tools integrate with what IT is using to integrate data. Yet new regulations are demanding that we do a better job of governing data, and business executives are demanding more agility to remain competitive in a digital economy. So how can companies remain agile, reduce cost and reduce the time-to-value when data complexity is on the up?
In this session, Mike will discuss how companies can create an information supply chain to manufacture business-ready data and analytics to reduce time to value and improve agility while also getting data under control.
Date: 13th November 2018
Location: Governance and MDM Theatre
Time: 12:30 - 13:00
Organisation: Immuta
About: Artificial intelligence is rising in importance, but it’s also increasingly at loggerheads with data protection regimes like the GDPR—or so it seems. In this talk, Sophie will explain where and how AI and GDPR conflict with one another, and how to resolve these tensions.
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Matt Stubbs
Date: 13th November 2018
Location: Governance and MDM Theatre
Time: 11:50 - 12:20
Speaker: Mark Pritchard
Organisation: Denodo
About: Self-service analytics promises to liberate business users to perform analytics without the assistance of IT, and this in turn promises to free IT to focus on enhancing the infrastructure.
Join us to learn how data virtualization will allow you to gain real-time access to enterprise-wide data and deliver self-service analytics. We will explore how you can seamlessly unify fragmented data, replace your high-maintenance and high cost data integrations with a single, low-maintenance data virtualization layer; and how you can preserve your data integrity and ensure data lineage is fully traceable.
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Matt Stubbs
Date: 13th November 2018
Location: Governance and MDM Theatre
Time: 11:10 - 11:40
Organisation: TIBCO
About: The big data phenomenon continues to accelerate, resulting in multiple data lakes at most organisations. However, according to Gartner, “Through 2019, 90% of the information assets from big data analytic efforts will be siloed and unusable across multiple business processes.”
Are you ready to unleash this data from these silos and deliver the insights your organisation needs to drive compelling customer experiences, innovative new products and optimized operations? In this session you will learn how to apply data virtualisation to: - Access, transform and deliver data from across your lakes, clouds and other data sources - Empower a range of analytic users and tools with all the data they need - Move rapidly to a modern and flexible data architecture for the long run In addition, you will see a demonstration of data virtualisation in action.
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Matt Stubbs
Date: 14th November 2018
Location: Data-Driven Ldn Theatre
Time: 12:30 - 13:00
Organisation: Cloudera
About: The growth of public cloud is reinforcing the need to think more carefully about taking a consistent approach to data governance as technology teams build out a flexible and agile infrastructure to meet the demands of the business.
Join this session to learn more about Cloudera's recommended approach for enterprise-grade security and governance and how to ensure a consistent framework across private, public and on-premises environments.
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSMatt Stubbs
Date: 14th November 2018
Location: Data-Driven Ldn Theatre
Time: 11:10 - 11:40
Organisation: Microlise
About: Microlise are a leading provider of technology solutions to the transport and logistics industry worldwide. Discover how, with over 400,000 connected assets generating billions of messages a day, Microlise is evolving its platform to bring real-time analytics to its customers to improve safety, security and efficiency outcomes.
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEMatt Stubbs
Date: 14th November 2018
Location: Data-Driven Ldn Theatre
Time: 10:30 - 11:00
Speaker: Anna Matty
Organisation: Experian
About: Today there is a widespread focus on the 'how' in relation to problem solving. How can we gain better knowledge of what consumers want, or need? How can we be more efficient, reduce the cost to serve, or grow the lifetime value of a customer? But, how do you move to a place where you are not only solving a problem, you are redesigning the entire strategic potential of that problem? You are being armed with insight on what the problem is.
Data and innovation offer huge potential to revolutionise all markets. There is an opportunity to be one step ahead of the need, to redesign journeys and enhance enterprise strategies. To do this you need access to the most advanced analytics but also the best quality, including variations and types of data, and then the technology that can act on this insight. Data science can present a unique opportunity for uncovered growth and accelerate your business through strategic innovation – fast. In this session you will hear more about how today's analytics can move from a single task, to an ongoing strategic opportunity. An opportunity that helps you move at the speed of the market and helps you maximise every opportunity.
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGMatt Stubbs
Date: 13th November 2018
Location: Data-Driven Ldn Theatre
Time: 13:10 - 13:40
Speaker: Brian Goral
Organisation: Cloudera
About: The field of machine learning (ML) ranges from the very practical and pragmatic to the highly theoretical and abstract. This talk describes several of the challenges facing organisations that want to leverage more of their data through ML, including some examples of the applied algorithms that are already delivering value in business contexts.
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Matt Stubbs
Date: 13th November 2018
Location: Data-Driven Ldn Theatre
Time: 12:30 - 13:00
Speaker: Paul Wilkinson, Naveen Gupta
Organisation: Cloudera
About: Investment banks are faced with some of the toughest regulatory requirements in the world. In a market where data is increasing and changing at extraordinary rates the journey with data governance never ends.
In this session, Deutsche Bank will share their journey with big data and explain some of the processes and techniques they have employed to prepare the bank for today’s challenges and tomorrow’s opportunities.
Brought to you by Naveen Gupta, VP Software Engineering, Deutsche Bank and Paul Wilkinson, Principal Solutions Architect, Cloudera.
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Matt Stubbs
Date: 14th November 2018
Location: Self-Service Analytics Theatre
Time: 13:50 - 14:20
Speaker: Stephanie McReynolds
Organisation: Alation
About: Raw data is proliferating at an enormous rate. But so are our derived data assets - hundreds of dashboards, thousands of reports, millions of transformed data sets. Self-service analytics have ensured that this noise is making it increasingly hard to understand and trust data for decision-making. This trust gap is holding your organisation back from business outcomes.
European analytics leaders have found a way to close the gap between data and decision-making. From MunichRe to Pfizer and Daimler, analytics teams are adopting data catalogues for thousands of self-service analytics users.
Join us in this session to hear how data catalogues that activate data by incorporating machine learning can:
• Increase analyst productivity 20-40%
• Boost the understanding of the nuances of data and
• Establish trust in data-driven decisions with agile stewardship
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEMatt Stubbs
Date: 13th November 2018
Location: Self-Service Analytics Theatre
Time: 15:50 - 16:20
Speaker: Nishanth Kadiyala
Organisation: Progress
About: The exploding API economy, combined with an advanced analytics market projected to reach $30 billion by 2019, is forcing IT to expose more and more data through APIs. Business analysts, data engineers, and data scientists are still not happy because their needs never really made it into the existing API strategies. This is because most APIs are designed for application integration, but not for the data workers who are looking for APIs that facilitate direct data access to run complex analytics. Data APIs are specifically designed to provide that frictionless data access experience to support analytics across standard interoperable interfaces such as OData (REST) or ODBC/JDBC (SQL). Consider expanding your API strategy to service the developers with open analytics in this $30 billion market.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
8. 8
Typical Spark Application Structure
8
Spark Training
Data is loaded into Spark Model is saved in files
File System Custom Server
Model is loaded to your
custom app
Serving Client
Client App
12. 12
A Quick Recap of Redis
Key
"I'm a Plain Text String!"
{ A: “foo”, B: “bar”, C: “baz” }
Strings / Bitmaps / BitFields
Hash Tables (objects!)
Linked Lists
Sets
Sorted Sets
Geo Sets
HyperLogLog
{ A , B , C , D , E }
[ A → B → C → D → E ]
{ A: 0.1, B: 0.3, C: 100, D: 1337
}
{ A: (51.5, 0.12), B: (32.1, 34.7)
}
00110101 11001110 10101010
13. 13
Redis Modules
• Any C/C++ program can now run on Redis
• Use existing or add new data-structures
• Enjoy simplicity, infinite scalability and high availability while
keeping the native speed of Redis
• Can be created by anyone
New Capabilities
New Commands
New Data Types
14. 14
Redis-ML: Predictive Model Serving Engine
• Predictive models as native Redis types
• Perform evaluation directly in Redis
• Store training output as “hot model”
Spark Training
Data loaded into Spark Model is saved in
Redis-ML
Redis-ML
Serving Client
Client
App
Client
App
Client
App
Any Training
Platform
15. 15
Redis ML Module
Redis Module
Tree Ensembles
Linear Regression
Logistic Regression
Matrix + Vector Operations
More to come...
16. 16
Random Forest Model
• A collection of decision trees
• Supports classification & regression
• Splitter Node can be:
◦ Categorical (e.g. day == “Sunday”)
◦ Numerical (e.g. age < 43)
• Decision is taken by the majority of decision trees
17. 17
Classic Tree Problem: Titanic Survival
YES
Sex =
Male ?
Age <
9.5?
Sibps >
2.5?
Survived
Died
SurvivedDied
NO
• Passenger Data encoded as feature vectors
• ML Algorithm learns the tree rules
• ID3, CART (RPART), etc.
• Tree rules used to infer results
18. 18
Titanic Survival: Random Forest
YES
Sex =
Male ?
Age <
9.5?
*Sibps >
2.5?
Survived
Died
SurvivedDied
NO YES
Country=
US?
State =
CA?
Height>
1.60m?
Survived
Died
SurvivedDied
NO YES
Weight<
80kg?
I.Q<100?
Eye color
=blue?
Survived
Died
SurvivedDied
NO
Tree #1 Tree #2 Tree #3
19. 19
Who Would Survive the Titanic
John:
• Male, 34,
• Married w/ 2 kids (Sibps=3)
• New York, USA
• 1.78m, 78kg
• 110 iq
• Blue eyes
Mathew:
• Male, 6
• 3 Sisters (Sibps=3)
• New York, USA
• 1.06m, 22.7 kg
• 100 iq
• Brown eyes
Let's use our forest to find out
20. 20
Redis: Forest Data Type
Add nodes to a tree in a forest:
Perform classification/regression of a feature vector:
ML.FOREST.ADD <forestId> <treeId> <path>
[ [NUMERIC|CATEGORIC] <splitterAttr> <splitterVal> ] |
[LEAF] <predVal>
ML.FOREST.RUN <forestId> <features>
[CLASSIFICATION|REGRESSION]
21. 21
Real World Challenge
• Ad serving company
• Need to serve 20,000 ads/sec @ 50msec data-center latency
• Runs 1k campaigns → 1K random forest
• Each forest has 15K trees
• On average each tree has 7 levels (depth)
22. 22
Ad Serving costs: Homegrown v. Redis
Homegrown
1,247 x c4.8xlarge 35 x c4.8xlarge
Cut computing infrastructure
by 97%
22
23. 23
Redis ML with Spark ML
Random Forest; 1,000 forests @ 15,000 trees
Classification Time Over Spark
13x Faster
27. 27
Step 1: Get The Data
Download and extract the MovieLens 100K Dataset
The data is organized in separate files:
• Ratings: user id | item id | rating (1-5) | timestamp
• Item (movie) info: movie id | genre info fields (1/0)
• User info: user id | age | gender | occupation
Our classifier should return the expected rating (from 1 to 5) a user would give the movie in question
28. 28
Step 2: Transform
28
The training data for each movie should contain 1 line per user:
• class (rating from 1 to 5 the user gave to this movie)
• user info (age, gender, occupation)
• user ratings of other movies (movie_id:rating ...)
• user genre rating averages (genre:avg_score ...)
Run gen_data.py to transform the files to the desired format
29. 29
Step3: Train and Load to Redis
// Create a new forest instance
val rf = new
RandomForestClassifier().setFeatureSubsetStrategy("auto").setLabelCol("indexedLabel").setFeat
uresCol("indexedFeatures").setNumTrees(500)
…..
// Train model
val model = pipeline.fit(trainingData)
…..
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
// Load the model to redis
val f = new Forest(rfModel.trees)
f.loadToRedis(”movie-10", "127.0.0.1")
30. 30
Step 4: Execute inference in Redis
Redis-ML
+
Spark
Training
Client App
31. 31
Summary
• Train with Spark, Serve with Redis
• 97% resource cost serving
• Simplify ML lifecycle
• Redise (Cloud or Pack):
‒Scaling, HA, Performance
‒PAYG – cost optimized
‒Ease of use
‒Supported by the teams who created Spark and
Redis
Spark Training
Data loaded into Spark Model is saved in
Redis-ML
Redis-ML
Serving Client
Client
App
Client
App
Client
App
+