Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.
To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.
Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask?
In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.
In this session, you’ll learn:
– Some background about Spark at Netflix
– About output committers, and how both Spark and Hadoop handle failures
– How HDFS and S3 differ, and why HDFS committers don’t work well
– A new output committer that uses the S3 multi-part upload API
– How you can use this new committer in your Spark applications to avoid duplicating data
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.
To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.
Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask?
In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.
In this session, you’ll learn:
– Some background about Spark at Netflix
– About output committers, and how both Spark and Hadoop handle failures
– How HDFS and S3 differ, and why HDFS committers don’t work well
– A new output committer that uses the S3 multi-part upload API
– How you can use this new committer in your Spark applications to avoid duplicating data
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
At Databricks, we have a unique view into hundreds different companies using Apache Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common manageability, debugging, and visibility issues that our users run into. This talk will first show some representatives of these common issues. Then, we will show you what we have done and have been working on in Databricks to make Spark clusters easier to manage, monitor, and debug.
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit
Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
Machine learning has its challenges, and understanding the algorithms is not always easy. In this session, you’ll discover methods to make these challenges less daunting.
Intended for software engineers who need to understand the requirements and constraints of data scientists, and data scientists who need to implement or help implement production systems, the session will begin with a quick introduction to data quality and a level-set on common vocabulary. You’ll then explore the formats that are required by Spark ML to run its algorithms, and see how to automate the build through user-defined functions and other techniques. Automation will make reproducibility easy, minimize errors and increase the efficiency of data scientists.
Key takeaways will include:
– How to build the required tool set in Java
– Understanding the formats required by Spark ML (a new vocabulary)
– Learning fundamentals about data quality and how to make sure the data is usable
Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks
Organizations from small startups to large enterprises are rapidly adopting Apache Spark on Amazon EMR in Amazon Web Services (AWS) to run streaming analytics, data science, machine learning, and batch processing workloads. These customers can quickly create big data architectures within minutes, and decouple compute and storage with Amazon S3 as a highly scalable, durable, and secure data lake, lower costs using Amazon EC2 Spot Instances and Auto Scaling, and utilize a wide range of encryption and access control features. In this session, we discuss how customers are using Spark on AWS and common architectures for easily running performant Spark clusters at scale and low cost with Amazon EMR.
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
Data scientists write SQL queries everyday. Very often they know how to write correct queries but don’t know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene.
In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark.
Specifically:
– Why and how do we migrate?
– How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift
– How do we scale the team on Spark (with 80+ people in our data science team)
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
This session presents a simple, human-based approach to create test suites targeting multiple points of contact in a data solution. Commonly, an enterprise will pick a data processing solution with heavy GUIs because it can make an easy to understand workflow around data. However, those solutions still are not able to verify the simplest use case, i.e. “If I put data into a solution to process data, then I should get a desired result.”
FIS will demonstrate and teach you how to build a unique testing solution on top of Apache Spark. Under its solution, FIS can actually prove to users in their organization that when they put data in, they get the correct result out. They can also enlist their entire team from product owner to developer to write complete unit tests. The type of flexibility Spark enables allows you to take unique paths in building robust, understandable data flows. The transformational element is the ability to do this in milliseconds, and not wait till the entire pipeline finishes.
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks
Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they’ll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc.
Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements.
Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They’ll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.
Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.
Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.
Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system.
Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.
An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
At Databricks, we have a unique view into hundreds different companies using Apache Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common manageability, debugging, and visibility issues that our users run into. This talk will first show some representatives of these common issues. Then, we will show you what we have done and have been working on in Databricks to make Spark clusters easier to manage, monitor, and debug.
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit
Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
Machine learning has its challenges, and understanding the algorithms is not always easy. In this session, you’ll discover methods to make these challenges less daunting.
Intended for software engineers who need to understand the requirements and constraints of data scientists, and data scientists who need to implement or help implement production systems, the session will begin with a quick introduction to data quality and a level-set on common vocabulary. You’ll then explore the formats that are required by Spark ML to run its algorithms, and see how to automate the build through user-defined functions and other techniques. Automation will make reproducibility easy, minimize errors and increase the efficiency of data scientists.
Key takeaways will include:
– How to build the required tool set in Java
– Understanding the formats required by Spark ML (a new vocabulary)
– Learning fundamentals about data quality and how to make sure the data is usable
Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks
Organizations from small startups to large enterprises are rapidly adopting Apache Spark on Amazon EMR in Amazon Web Services (AWS) to run streaming analytics, data science, machine learning, and batch processing workloads. These customers can quickly create big data architectures within minutes, and decouple compute and storage with Amazon S3 as a highly scalable, durable, and secure data lake, lower costs using Amazon EC2 Spot Instances and Auto Scaling, and utilize a wide range of encryption and access control features. In this session, we discuss how customers are using Spark on AWS and common architectures for easily running performant Spark clusters at scale and low cost with Amazon EMR.
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
Data scientists write SQL queries everyday. Very often they know how to write correct queries but don’t know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene.
In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark.
Specifically:
– Why and how do we migrate?
– How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift
– How do we scale the team on Spark (with 80+ people in our data science team)
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
This session presents a simple, human-based approach to create test suites targeting multiple points of contact in a data solution. Commonly, an enterprise will pick a data processing solution with heavy GUIs because it can make an easy to understand workflow around data. However, those solutions still are not able to verify the simplest use case, i.e. “If I put data into a solution to process data, then I should get a desired result.”
FIS will demonstrate and teach you how to build a unique testing solution on top of Apache Spark. Under its solution, FIS can actually prove to users in their organization that when they put data in, they get the correct result out. They can also enlist their entire team from product owner to developer to write complete unit tests. The type of flexibility Spark enables allows you to take unique paths in building robust, understandable data flows. The transformational element is the ability to do this in milliseconds, and not wait till the entire pipeline finishes.
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks
Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they’ll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc.
Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements.
Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They’ll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.
Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.
Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.
Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system.
Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.
An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
From Lab to Factory: Or how to turn data into valuePeadar Coyle
We've all heard of 'big data' or data science, but how do we convert these trends into actual business value. I share case studies, and technology tips and talk about the challenges of the data science process. This is all based on two years of in-the-field research of deploying models, and going from prototypes to production.
These are slides from my talk at PyCon Ireland 2015
Making the Most of In-Memory: More than SpeedInside Analysis
The Briefing Room with Robin Bloor and Kognitio
Live Webcast Oct. 1, 2013
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7539482&rKey=bc304aa8dac7b781
Everyone’s talking about in-memory these days, and the term has become synonymous with speed. But pinning data into memory is just the beginning, and it’s about more than speed. In-memory solutions need a tailored architecture, one that can take full advantage RAM processing from every aspect, and this requires an approach that considers memory and CPU from the ground-up.
Register for this episode of The Briefing Room to hear from veteran Analyst Robin Bloor as he explains how memory is on the fast track to supersede disk, at least with respect to advanced analytics. He’ll be briefed by Kognitio CTO Roger Gaskell, who pioneered the in-memory analytical platform since its inception in 1989. He will also discuss how this type of solution changes the landscape for the modern data architecture and its impact on advanced analytical capabilities.
Visit InsideAnalysis.com for more information
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Analytic workload characteristics and constraints
* Data architecture
* Functional architecture
* Tradeoffs between different classes of technology
* Technology planning assumptions and guidance
#strataconf
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
10. … Virtualized Everything
Lightning Review of Data Platforms
1. Assign Tasks their own virtual hardware
2. Expend / Contract Resources by demand
3. Real-time HotSwapping
4. Software Updates Built In
5. Etc Etc Etc.
11. idiosyncratic Organizing Principles
iOP1) Clarity
iOP2) Engineers are not Data Scientists
iOP3) PMs are not Data Scientists
iOP4) Data Scientists are not Engineers
iOP5) Close the Data Loop
12. iOP1: Data Clarity
PUBLISH THIS INTERNALLY!
“big data, big noise”
Where is what data?
Who owns what field?
What is this this field?
Where did this field go?
Why is this field NULL?
13. iOP1: Data Clarity
Minibatch streaming into nested JSON?
O(10kB)?
GZip?
O(50-500 MB)
Parquet.
Snappy.
“Expect Data Science”
Spark
Big Thanks to Jason Pohl @ DB!
And Charles Pritchard!
14. iOP2: Engineers are not Data Scientists
“why would you need to do that?”
Scratch Space
Cluster Bootstrap Permissions
Access S3 Buckets
Sandbox Clusters
Share Notebooks Across Accounts
We DO NOT SPEAK IAM Role/Anything
15. iOP2: Engineers are not Data Scientists
“why would you need to do that?”
if possible:
Write your own Pipelines.
else:
Explain Data Science.
16. iOP3: PMs are not Data Scientists
“you don’t need that!”
Once upon a time in the Flight DataLake…
Only 10% of a Search Impression was recorded
Worse: It was only the Cheapest10%
Many of the bookings where not included in this list!
17. iOP4: Data Scientists are Not Engineers
“we need to support models in @#&%? format”
Pick a Robust Standard and Stick to It.
If you’re big enough to worry about this, you can commit code
jPMML
Everybody Use Git. Now. Yes You.
Production Code Matters. Format. Document.
Pipelines Count as Production Code.
18. iOP5: Close that Data Loop
What is your data doing?
New Data? Consider Bandits!
Big Data? Set up a learning problem.
Empower by Design.