In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
Your Challenge
It is difficult to start the project, engage the right people, and find the necessary requirements to drive the value of an enterprise architecture operating model.
It is challenging to navigate the common enterprise architecture (EA) frameworks and right-size them for your organization.
The EA practice may struggle to effectively collaborate with the business when making decisions, resulting in outcomes that fail to engage stakeholders.
Our Advice
Critical Insight
The benefits of an EA program are only realized when all components of the operating model enable the achievement of the program goals and objectives. Many times organizations overplay the governance card while ignoring the motivational aspects that can be addressed through the organization's structure or stakeholder relations.
Info-Tech’s methodology ensures that all components of an EA operating model are considered to optimize the performance of the EA program.
Impact and Result
Place and structure your EA team to address the needs of stakeholders and deliver on the previously created strategy.
Create an engagement model by understanding each relevant process of COBIT 5 and make stakeholder interaction cards to initiate conversations.
Recognize the need for governance and formulate the appropriate boards while considering various policies, principles, and compliance.
Develop a unique architecture development framework based on best-practice approaches with an understanding of the various architectural views to ensure the creation of a successful process.
Build a communication plan and roadmap to efficiently navigate through enterprise change and involve the necessary stakeholders.
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Data Con LA
Data Con LA 2020
Description
It’s no secret that the roots of Data Science date back to the 1960’s and were first mainstreamed in the 1990’s with the emergence of Data Mining. This occurred when commercially affordable computers started offering the horsepower and storage necessary to perform advanced statistics to scale.
However, the words “to scale” have evolved over time. The leap to “Big Data” is only one serial aspect of growth. Beyond the typical 1-off studies that catalyzed the field of Data Mining, Data Science now fulfills enterprise and multi-enterprise use cases spanning much broader and deeper data sets and integrations. For example, AI and Machine Learning frameworks can interoperate with a variety of other systems to drive alerting, feedback loops, predictive frameworks, prescriptive engines, continual learning, and more. The deployment of AI/ML processes themselves often involves integration with contemporary DevOps tools.
Now segue to SEAL – the Scalable Enterprise Analytic Lifecycle. In this presentation, you’ll learn how to cover the major bases of a modern Data Science projects – and Citizen Data Science as well – from conception, learning, and evaluation through integration, implementation, monitoring, and continual improvement. And as the name implies, your deployments will be performant and scale as expected in today’s environments.
Speaker
Jeff Bertman, CTO, Dfuse Technologies
Cloud Migration, Application Modernization and Security for PartnersAmazon Web Services
As AWS continues to expand, enterprise customers are increasingly looking to our partner ecosystem to assist in migrating their workloads to the cloud. This session describes the challenges, lessons learned and best practices for large scale application migrations. We will use real examples from our consulting partners and AWS Professional Services to illustrate how to move workloads to the cloud while modernizing the associated applications to take advantage of AWS’ unique benefits. We will also dive into how to use an array of AWS services and features to improve a customer’s security posture as they are migrating and once they are up and running in the cloud.
Cloud Migration, Application Modernization, and Security Tom Laszewski
As AWS continues to expand, enterprise customers are looking to our partner ecosystem to assist in migrating their workloads to the cloud. This session describes the challenges, lessons learned and best practices for large scale application migrations. We will use real examples from our consulting partners and AWS Professional Services to illustrate how to move workloads to the cloud while modernizing the associated applications to take advantage of AWS’ unique benefits. We will also dive into how to use an array of AWS services and features to improve a customer’s security posture as they are migrating and once they are up and running in the cloud
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettDaniel Zivkovic
Leigha Jarett of GCP explains how to bring Cloud "superpowers" to your Data and modernize your Business Intelligence with Looker, BigQuery and Google Cloud services on an example of Cymbal Direct - one of Google Cloud's demo brands. The meetup recording with TOC for easy navigation is at https://youtu.be/BpzJU_S40ic.
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Your Challenge
It is difficult to start the project, engage the right people, and find the necessary requirements to drive the value of an enterprise architecture operating model.
It is challenging to navigate the common enterprise architecture (EA) frameworks and right-size them for your organization.
The EA practice may struggle to effectively collaborate with the business when making decisions, resulting in outcomes that fail to engage stakeholders.
Our Advice
Critical Insight
The benefits of an EA program are only realized when all components of the operating model enable the achievement of the program goals and objectives. Many times organizations overplay the governance card while ignoring the motivational aspects that can be addressed through the organization's structure or stakeholder relations.
Info-Tech’s methodology ensures that all components of an EA operating model are considered to optimize the performance of the EA program.
Impact and Result
Place and structure your EA team to address the needs of stakeholders and deliver on the previously created strategy.
Create an engagement model by understanding each relevant process of COBIT 5 and make stakeholder interaction cards to initiate conversations.
Recognize the need for governance and formulate the appropriate boards while considering various policies, principles, and compliance.
Develop a unique architecture development framework based on best-practice approaches with an understanding of the various architectural views to ensure the creation of a successful process.
Build a communication plan and roadmap to efficiently navigate through enterprise change and involve the necessary stakeholders.
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Data Con LA
Data Con LA 2020
Description
It’s no secret that the roots of Data Science date back to the 1960’s and were first mainstreamed in the 1990’s with the emergence of Data Mining. This occurred when commercially affordable computers started offering the horsepower and storage necessary to perform advanced statistics to scale.
However, the words “to scale” have evolved over time. The leap to “Big Data” is only one serial aspect of growth. Beyond the typical 1-off studies that catalyzed the field of Data Mining, Data Science now fulfills enterprise and multi-enterprise use cases spanning much broader and deeper data sets and integrations. For example, AI and Machine Learning frameworks can interoperate with a variety of other systems to drive alerting, feedback loops, predictive frameworks, prescriptive engines, continual learning, and more. The deployment of AI/ML processes themselves often involves integration with contemporary DevOps tools.
Now segue to SEAL – the Scalable Enterprise Analytic Lifecycle. In this presentation, you’ll learn how to cover the major bases of a modern Data Science projects – and Citizen Data Science as well – from conception, learning, and evaluation through integration, implementation, monitoring, and continual improvement. And as the name implies, your deployments will be performant and scale as expected in today’s environments.
Speaker
Jeff Bertman, CTO, Dfuse Technologies
Cloud Migration, Application Modernization and Security for PartnersAmazon Web Services
As AWS continues to expand, enterprise customers are increasingly looking to our partner ecosystem to assist in migrating their workloads to the cloud. This session describes the challenges, lessons learned and best practices for large scale application migrations. We will use real examples from our consulting partners and AWS Professional Services to illustrate how to move workloads to the cloud while modernizing the associated applications to take advantage of AWS’ unique benefits. We will also dive into how to use an array of AWS services and features to improve a customer’s security posture as they are migrating and once they are up and running in the cloud.
Cloud Migration, Application Modernization, and Security Tom Laszewski
As AWS continues to expand, enterprise customers are looking to our partner ecosystem to assist in migrating their workloads to the cloud. This session describes the challenges, lessons learned and best practices for large scale application migrations. We will use real examples from our consulting partners and AWS Professional Services to illustrate how to move workloads to the cloud while modernizing the associated applications to take advantage of AWS’ unique benefits. We will also dive into how to use an array of AWS services and features to improve a customer’s security posture as they are migrating and once they are up and running in the cloud
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettDaniel Zivkovic
Leigha Jarett of GCP explains how to bring Cloud "superpowers" to your Data and modernize your Business Intelligence with Looker, BigQuery and Google Cloud services on an example of Cymbal Direct - one of Google Cloud's demo brands. The meetup recording with TOC for easy navigation is at https://youtu.be/BpzJU_S40ic.
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Moving Your Data Center: Keys to planning a successful data center migrationData Cave
Moving a data center is something that most IT professionals will have to engage in at some point, and this presentation looks at several areas that can help to make a data center migration as smooth and seamless as possible. We cover areas such as:
-Determining whether to replicate your existing data center infrastructure, or build a new infrastructure as part of the migration project.
-The importance of experience during the logistical side of the data center migration.
-The significance of planning as well as evaluating and fine-tuning your plan on an ongoing basis.
If you will soon be in the process of planning and performing a data center migration, then we encourage you to read up!
Mainframe Modernization with Precisely and Microsoft AzurePrecisely
Today’s businesses are leveraging Microsoft Azure to modernize operations, transform customer experience, and increase profit. However, if the rich data generated by the mainframe applications is missed in the move to the cloud, you miss the mark.
Without the right solutions in place, migrating mainframe data to Microsoft Azure is expensive, time-consuming, and reliant on highly specialized skillsets. Precisely Connect can quickly integrate mainframe data at scale into Microsoft Azure without sacrificing functionality, security, or ease of use.
View this on-demand webinar to hear from Microsoft Azure and Precisely data integration experts. You will:
- Learn how to build highly scalable, reliable data pipelines between the mainframe and Microsoft Azure services
- Understand how to make your Microsoft Azure implementation ready for mainframe
- Dive into case studies of businesses that have successfully included mainframe data in their cloud modernization efforts with Precisely and Microsoft Azure
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...Amazon Web Services
In this session, we provide a peek behind the scenes to learn about Amazon ElastiCache's design and architecture. See common design patterns with our Redis and Memcached offerings and how customers have used them for in-memory operations to reduce latency and improve application throughput. During this session, we review ElastiCache best practices, design patterns, and anti-patterns.
The presentation describes types of data pipeline architectures. It contains information about AWS services needed to create data pipelines based on Amazon Web Services. Also, users can find different diagrams of implemented pipelines on AWS.
In this presentation from the AWS Lab at Cloud Expo Europe 2014 you will find details of the six patterns that Enterprise organisations typically to follow when adopting Amazon Web Services as well as a summary of the licensing options available for running enterprise applications on Amazon Web Services.
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
This presentation explains in detail what a Data Lake Architecture looks like, how data virtualization fits into the Logical Data Lake, and goes over some performance tips. Also it includes an example demonstrating this model's performance.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/9Jwfu6.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Moving Your Data Center: Keys to planning a successful data center migrationData Cave
Moving a data center is something that most IT professionals will have to engage in at some point, and this presentation looks at several areas that can help to make a data center migration as smooth and seamless as possible. We cover areas such as:
-Determining whether to replicate your existing data center infrastructure, or build a new infrastructure as part of the migration project.
-The importance of experience during the logistical side of the data center migration.
-The significance of planning as well as evaluating and fine-tuning your plan on an ongoing basis.
If you will soon be in the process of planning and performing a data center migration, then we encourage you to read up!
Mainframe Modernization with Precisely and Microsoft AzurePrecisely
Today’s businesses are leveraging Microsoft Azure to modernize operations, transform customer experience, and increase profit. However, if the rich data generated by the mainframe applications is missed in the move to the cloud, you miss the mark.
Without the right solutions in place, migrating mainframe data to Microsoft Azure is expensive, time-consuming, and reliant on highly specialized skillsets. Precisely Connect can quickly integrate mainframe data at scale into Microsoft Azure without sacrificing functionality, security, or ease of use.
View this on-demand webinar to hear from Microsoft Azure and Precisely data integration experts. You will:
- Learn how to build highly scalable, reliable data pipelines between the mainframe and Microsoft Azure services
- Understand how to make your Microsoft Azure implementation ready for mainframe
- Dive into case studies of businesses that have successfully included mainframe data in their cloud modernization efforts with Precisely and Microsoft Azure
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...Amazon Web Services
In this session, we provide a peek behind the scenes to learn about Amazon ElastiCache's design and architecture. See common design patterns with our Redis and Memcached offerings and how customers have used them for in-memory operations to reduce latency and improve application throughput. During this session, we review ElastiCache best practices, design patterns, and anti-patterns.
The presentation describes types of data pipeline architectures. It contains information about AWS services needed to create data pipelines based on Amazon Web Services. Also, users can find different diagrams of implemented pipelines on AWS.
In this presentation from the AWS Lab at Cloud Expo Europe 2014 you will find details of the six patterns that Enterprise organisations typically to follow when adopting Amazon Web Services as well as a summary of the licensing options available for running enterprise applications on Amazon Web Services.
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
This presentation explains in detail what a Data Lake Architecture looks like, how data virtualization fits into the Logical Data Lake, and goes over some performance tips. Also it includes an example demonstrating this model's performance.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/9Jwfu6.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can lead to a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Apache Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at a low cost by co-designing the proposed in-memory distributed file system with large-volume DIMMbased persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a productionlevel cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5-fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase by 66.5% the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Similar to Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction with Geospatial Visualization (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
2. Hongchan Roh, Dooyoung Hwang, SK Telecom
Spark AI Usecase in Telco: Network
Quality Analysis and Prediction with
Geospatial Visualization
#UnifiedDataAnalytics #SparkAISummit
3. Network Quality Visualization Demo
3
youtube demo video link:
https://youtu.be/HpDkF3CxEow
▸ Demo shows
- Visualize RF quality of cell towers
- Height : connection count
- Colors : RF quality
▸ Data source
- 300,000 cell tower network device logs
▸ Resources
- 5 CPU nodes with Intel Xeon Gold 6240
Good Bad
This doesn’t exactly reflect real network quality,
we generated synthetic data from real one
5. Network Quality Analysis
5
SK Telecom : The largest telecommunications provider in South Korea
• 27 million subscribers
• 300,000 cells
Target Data: Radio cell tower logs
• time-series data with timestamp tags generated every 10 seconds
• geospatial data with geographical coordinates (latitude and
longitude) translated by cell tower’s location
6. Network Quality Analysis
6
Data Ingestion Requirements
• Ingestion 1.4 million records / sec, (500 byte of records, 200~500 columns)
• 120 billion records / day, 60 TB / day
• Data retention period: 7 ~ 30 days
Query Requirements
• web dashboard and ad-hoc queries: response within 3 sec for a specific
time and region (cell) predicates for multi sessions
• daily batch queries: response within hours for long query pipelines
having heavy operations such as joins and aggregations
7. Problems of legacy architecture
7
Legacy Architecture: Spark with HDFS (2014~2015)
• Tried to make partitions reflecting common query predicates (time, region)
• Used SSD as a block cache to accelerate I/O performance
• Hadoop namenode often crashed for millions of partitions
• Both ingestion and query performance could not satisfy the requirements
L i n u x O S
C P U / D R A M
S t o r a g e ( H D D )
…
H D F S
S p a r k
S S D C a c h e
Legacy
Architecture
8. New In-memory Datastore for Spark (FlashBase)
8
• Tried to design a new data store for Spark (FlashBase) to support much more partitions
L i n u x O S
C P U / D R A M
S t o r a g e ( H D D )
…
H D F S
S p a r k
S S D C a c h e
Legacy
Architecture
SQL Queries
(Web, Jupyter)
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
JDBCFile, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
New
Architecture
• for query engine, for DRAM key-value store and for SSD key-value store
• SSDs as main storage devices for small-sized parallel I/O with short latencies?
If we assemble these with some glue logics, can it be ?
Best open source candidates to assemble
New in-memory (dram/SSD) datastore for Spark (2016)
10. Problem 1 - Redis Space Amplification Problem
10
Redis occupied DRAM Space 4 times of input data
• At least 72 bytes for data structures
• 24B for dictEntry
• 24B for key redisObject and 24B for value redicObject
• 85 bytes for 12 byte key and 1 byte column value
FlashBase reduced DRAM usage to 1/4 of original Redis
• custom data structure called column value array
• store data in column-wise
• gather column values in different rows (c++ vector style)
• data layout is similar to Apache arrow
DRAM is still expensive!
11. Problem 2 - Rocksdb Write Amplification Problem
11
Rocksdb wrote 40 times of input data to SSDs
• Rocksdb consists of multi-level indexes with sorted key values in each level
• Key values are migrated from top level to next level by compaction algorithm
• Key values are written multiple times (# of levels)
with larger update regions (a factor of level multiplier)
FlashBase reduced writes to 1/10 of original Rocksdb
• Customized rocksdb for compaction job to avoid next level update at most
• 5 times better ingestion performance, and 1/10 TCO for SSD replacement
The more writes, the faster SSD fault!
• SSDs have limitation on the number of drive writes (DWPD)
• SSD fault causes service down time and more TCO for replacing SSDs
12. Query Acceleration 1 - Extreme Partitioning
12
A partition combination for network quality analysis
• 300K (cell towers) X 100K (time slots) = 30B (total partitions)
Time
partitions
Cell tower partitions
wvcv3 wvcyw wvfj6 … wyfb1 wyf9w
201804221100
201804221105
201804221110
…
201904221105
201904221110
Hadoop File System
Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs
• Up-to 1 billion partitions in a single cluster
• 300GB DRAM for 1 billion partitions
(150 byte DRAM for a file or a block)
FlashBase
• Up-to 2 billion partitions in a single node
• Needs 15 nodes to store 30 billion partitions
13. Query Acceleration 2 - Filter Push Down
13
• Custom relation definition to register redis as Spark’s datasource by using data
source API v1 (Redis/Rocksdb Relation -> R2 Relation)
• Filter / Projection pushdown to Redis/Rocksdb Store by using prunedScan and
prunedFilteredScan
package com.skt.spark.r2
case class
R2Relation (
identifier: String,
schema: StructType
)(@transient val sqlContext: SQLContext)
extends BaseRelation
with RedisUtils
with Configurable
with TableScan
with PrunedScan with PrunedFilteredScan with InsertableRelation
with Logging {
def buildTable(): RedisTable
override def buildScan(requiredColumns: Array[String]): RDD[Row]
def insert( rdd: RDD[Row] ): Unit
}
14. Query Acceleration 2 - Filter Push Down
14
Partition filtering
• Each redis process filters only satisfying partitions by using push downed filter predicates from Spark
• prunedScan is only requested to the satisfying partitions
Data filtering in pruned scan
• Each pruned scan command examines the actual column values are stored in the requested partition
• Only the column values satisfying the push downed filter predicates are returned
Spark Data Source Filter pushdown
• And, Or, Not, Like, Limit
• EQ, GT, GTE, LT, LTE, IN, IsNULL, IsNotNULL, EqualNullSafe,
15. Network Quality Analysis Example
15
Spark with FlashBase
Less than 1 sec
FlashBase Cluster: 16 nodes (E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs)
select * from ue_rf_sum
where event_time between '201910070000' and '201910080000' and cell_tower_id = 'snjJlAF5W' and rsrp < -85;
Half an hour
Spark with HDFS
Partition filtering
1/10080
with time
Partition filtering
1/(10080 * 30000)
with time and cell tower
1user_equipment_radio_frequency_summary table
HDFS Cluster: 20 nodes (E5 2650 v4 CPU, 256GB DRAM, 24TB HDDs)
Network quality analysis query for one day and a single cell tower
• 0.142 trillion (142 billion) records in ue_rf_sum1 table (7 day data, 42TB)
• 14,829 satisfying records
16. Ingestion Performance and other features
16
Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs
Features Details
Ingestion performance 500,000 records/sec/node
In-memory datastore DRAM only, DRAM to SSD Tiering
Massively Parallel Processing 100 redis processes per a single node
Extreme partitioning Up-to 2 billion partitions for a single node
Filter acceleration Using fine-grained partitions and push downed filters
Column-store Column-store by default (row-store option)
Column value transformation Defined by java grammar in schema tbl properties
Compression Gzip level compression ratio w/ LZ4 level speed
Vector processing Filter and aggr. acceleration (SIMD, AVX)
ETC Recovery, replication, scale-out
18. Network OSS Deploy – Web Dashboard (2017)
18
• Deployed to Network Operating System in 2017
• Web Dashboards queries with time and region predicates
• Wired/Wireless Quality Analysis
• Mobile device quality analysis : Display mobile device quality data per hierarchy
Mobile Device Quality
19. Network OSS Deploy – Batch Queries (2017)
19
• Batch queries with jupyter via R hive libraries
• Analysis for coverage hole risk for each cell
21. Why Geospatial Visualization?
21
• Geospatial Analysis
- Gathers, manipulates and displays geographic information system (GIS) data
- Requires heavy aggregate computations
→ Good case to demonstrate real-time big data processing
- Some companies demonstrated geospatial analysis to show advantages of GPU database
over CPU database
→ We have tried it with Spark & FlashBase based on CPU
22. Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Architecture of Geospatial Visualization
22
Map Rendering
HTTP API Vector Tile
Spark Job
Geospatial objectScan Cmd.
• Front-end : MapBoxJS
- MapBox uses VectorTile to render on overlay layers.
- Get VectorTile via HTTP API
23. Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Architecture of Geospatial Visualization
23
Map Rendering
HTTP API Vector Tile
Spark Job
• Front-end : MapBoxJS
- MapBox uses VectorTile to render on overlay layers.
- Get VectorTile via HTTP API
• Back-end Web server
- Build VectorTiles with Spark Job.
- Apache LIVY
: manipulate multiple Spark Contexts simultaneously
Geospatial objectScan Cmd.
Spark Job
24. Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Architecture of Geospatial Visualization
24
Map Rendering
HTTP API Vector Tile
Spark Job
• Front-end : MapBoxJS
- MapBox uses VectorTile to render on overlay layers.
- Get VectorTile via HTTP API
• Back-end Web server
- Build VectorTiles with Spark Job.
- Apache LIVY
: manipulate multiple Spark Contexts simultaneously
• Spark Cluster & Geo-Spark
- Geo-Spark : support Geospatial UDFs and PredicatesGeospatial objectScan Cmd.
Spark Job
Data source API
25. Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Architecture of Geospatial Visualization
25
Map Rendering
HTTP API Vector Tile
Spark Job
• Front-end : MapBoxJS
- MapBox uses VectorTile to render on overlay layers.
- Get VectorTile via HTTP API
• Back-end Web server
- Build VectorTiles with Spark Job.
- Apache LIVY
: manipulate multiple Spark Contexts simultaneously
• Spark Cluster & Geo-Spark
- Geo-Spark : support Geospatial UDFs and Predicates
• FlashBase
- FlashBase stores objects with latitude and longitude.
Each data is partitioned by its GeoHash
Geospatial objectScan Cmd.
Spark Job
Data source API
forked.
DRAM
Store
customized.
Flash
Store
tiering
Data
Loader
26. Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Architecture of Geospatial Visualization
26
Map Rendering
HTTP API Vector Tile
Spark Job
Geospatial objectScan Cmd.
Spark Job
Data source API
forked.
DRAM
Store
customized.
Flash
Store
tiering
Data
Loader
Problem
-Latency issue of HTTP API : Sluggish map loading !
-Building VectorTile needs
heavy computation & shuffling for aggregation.
256
X
256 pix.
▸ If web-client shows 20 Tiles,
→ 1.3 million (256 x 256 x 20) AGG
operations are required.
→ Cause heavy computations & shuffle
writing in Spark.
▸ If user scroll map,
→ all tiles should be re-calculated.
Building VectorTile requires
AGG GROUP BY pixel.
27. Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Architecture of Geospatial Visualization
27
Map Rendering
HTTP API Vector Tile
Spark Job
Scan Cmd.
Spark Job
Data source API
forked.
DRAM
Store
customized.
Flash
Store
tiering
Data
Loader
• Optimization of performance
1.Spark pushdowns aggregation to FlashBase
FlashBase sends aggregated results to Spark
→ Reduce Shuffle writing size
and computation of Spark to 1/10.
2. FlashBase accelerates aggregation with
vector-processing via Intel’s AVX-512
(Intel Math Kernel Library)
→ 2 x faster aggregation.
→ 20 times faster than original GeoSpark
(1)Pushdown aggregation
(3)Aggregated results
(2)Accelerate
aggregation via
AVX-512
28. Optimization Detail
28
The Query building features of VectorTile
SELECT * FROM pcell WHERE ST_VectorTileAggr('7,109,49’, ‘AVG')
1. ST_VectorTileAggr(arg1, arg2)
- Custom predicate which contains aggregation information.
- arg1 : zoom level of map & tile pos (x, y) in Globe
- arg2 : aggregation type (SUM or AVG)
2. Define & Apply a custom optimization rule
- Applied during optimization phase of query plan.
- Parse aggregation information from predicate and pushdown it to FlashBase
3. Aggregation in FlashBase
- Parallelized computation by FlashBase process count (Generally 100 ~ 200 process / node)
- Each process of FlashBase accelerates aggregation using Intel MKL.
256
X
256 pix.
30. Introduction of Network Quality Prediction
30
• Predict Network Quality Indicators (CQI, RSRP, RSRQ, SINR, …)
for anomaly detection and real-time management
• Goal : Unify Geospatial visualization & Network Prediction On Spark
* CQI : Channel Quality Indicator
* RSRP : Reference Signal Received Power
* RSRQ : Reference Signal Received Quality
* SINR :Signal to Interference Noise Ratio
*
31. We focused on
31
1. Improving deep learning model for forecasting time series data
2. Improving architecture and data pipeline for training and inference
32. Model for Network Quality Prediction - RNN
32
• RNN type model(Seq2Seq) is common solution for time-series prediction.
But not suitable for our network quality prediction.
Seq2Seq
Actual Forecast Error: MAE Score: Error*100
Cannot predict sudden change!
33. Memory augmented model
33
memory1 memory2 memory7 current
▪ ▪ ▪ ▪
Attention
layer
memory3
▪ ▪ ▪ ▪
▪ ▪ ▪ ▪
Encoder1 Encoder2▪ ▪ ▪ ▪Encoder1 Encoder1 Encoder1
1-week data
Concat FCNN !𝑦#$%
Final
prediction
1 32
4 5
Current
Recent 50 min data with 5 min period
Memory
Previous 7 days historical data each of
which has same time band with current and target.
Target
Network quality after 5 min
• Encoder : 1-NN (Autoregressive term)
Encoder1 : ℎ# = 𝑐 + 𝑤% 𝑦# +,-./0+% + … + 𝑤%% 𝑦#+, -./0+%%
Encoder2 : ℎ#
2
= 𝑐2
+ 𝑤%
2
𝑦#+% + … + 𝑤′%4 𝑦#+%4
1
2
3
4
5
35. Memory augmented model - Test result
35
Actual Forecast Error: MAE Score: Error*100
Mem-
model
Improved predictions for sudden change!
36. Training & Inference Architecture - Legacy
36
SQL Queries
(Web, Jupyter)
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
JDBCFile, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
export Preprocessing
Training & Inferencing
1. Export data to CSV from Spark ThriftServer using Hive Client
2. Preprocessing with pandas.
3. Train or infer with TensorFlow CPU
37. Training & Inference Architecture - Legacy
37
SQL Queries
(Web, Jupyter)
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
JDBCFile, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
export Preprocessing
Training & Inferencing
Problem
1. No in-memory Pipeline between data source and Deep-Learning layer
2. Pre-processing & Inference & Training are performed in single server.
38. Training & Inference Architecture - New
38
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Build In-memory Pipeline between FlashBase and Intel Analytics ZOO
Data Layer And Inferencing & Training Layer are integrated into the same Spark Cluster
Also share the same Spark session.
Source Code : https://github.com/mnms/ARMemNet-BigDL
Intel Analytics Zoo : Used to unify TF model into Spark Pipeline seamlessly.
Intel BigDL : inference & training engine
The processing of Inferencing & training can be distributed in Spark Cluster.
Preprocess RDD of Tensor Model Code of TF
DL Training & Inferencing
Data Model
Spark
Cluster
1
1
3
2
3
2
SIMD Acceleration
39. Comparison between two architectures
39
• Now only inference result : Also has a plan to run the distributed training later
• Test workload
- 7,722,912 rows = 80,447 cell towers X 8 days X 12 rows (1 hour data with 5 minutes period)
- 8 network indicators per row
→ Input tensor (80,447 X 696) = current input (80,447 X 10 X 8) + Memory input (80,447 X 77 X 8)
Pandas + TF Spark + Analytics Zoo
Data Export 2.3s N/A
Pre-processing 71.96s
local
2.56s
3 node
Yarn
1.43s
Deep Learning Inference 1.06s (CPU) / 0.63s (GPU) 0.68s 0.18s
Performed in a single node
- Data and computations are distributed
in 50 partitions.
- Preprocessing and inference are executed
in a single Spark job
45 x faster
※ CPU : Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
※ GPU : Nvidia-K80
40. Comparison between TF CPU and Analytics Zoo
40
• Compare Inference Time
• Environment
• CPU : Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
• Core : 36
Batch Size
Elapsed Time
(TF-CPU)
Elapsed Time
(Analytics Zoo Scala on Spark)
32 14.28068566 2.58333056
64 8.387897015 1.446166912
128 4.871953249 0.679720256
256 2.947942972 0.426830048
512 2.030963659 0.400032064
1024 2.012846947 0.395362112
2048 1.44268322 0.430505056
3~5 times faster
41. Appx. Memory Problem of Spark Driver
41
• collect() function of DataSet sometimes throws OOM while decompressing and
deserializing result.
→ Job Fails and Spark Driver is killed.
• Spark supports ‘spark.driver.maxResultSize’ config for this issue
- it just reflects a compressed size
- Actual result size would be 5x ~ 20x of compressed size.
- It is difficult to adjust the config to protect driver from OOM.
Result Stage
Driver
Executors
Array of Compressed
binaries
Compressed
result binaries
decompress &
deserialize
Result
Array[T]
OutOfMemoryException!
42. Appx. Memory Problem of Spark Driver - Solution
42
• Define collectAsSeqView function in DataSet
- Define SeqView which just holds compressed results and decompressing operations
- Driver decompresses and deserializes according to each fetching.
- Decompressed & deserialized results are collected as garbage after cursor moves to next.
- Only compressed binary reside in memory : memory of job can be limited by
‘spark.driver.maxResultSize’
→ Completely protect the driver from OOM while collecting results
• Define new function in DataSet which returns ‘SeqView’ of result.
Result Stage
Driver
Executors
Array of Compressed
binaries
Compressed
result binaries Create
SeqView
Add Operation of
decompressing
and deserializing
to View
Return
SeqView[T]
43. Appx. Memory Problem of Spark Driver - Patch
43
• collectAsSeqView function only uses 10% ~ 20% memory compared to collect
function.
• Create Spark Pull Request which applies this to thrift server.
- PR : SPARK-25224 (https://github.com/apache/spark/pull/22219)
- Review in progress
- Create Spark Pull Request which applies this to thrift serve
45. Open Discussion
45
• More partitioning or indexing with less partitions
• Spark datasource v2 and aggregation pushdown
• Possible new directions of FlashBase for Spark ecosystem
• Efficient end to end data pipeline for big data based inference and training
46. How to use Spark with FlashBase
46
Free binary can be used (Not open sourced yet)
• Public Cloud: AWS Marketplace AMI (~19.12.31), Cloud Formation (~20.3.31)
• On-premise: github page (~20.3)
First contact to us if you want to try FlashBase and get some help
• e-mail: flashbase@sktelecom.com
• homepage (temporary): https://flashbasedb.github.io