PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more.
Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook!
We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations.
Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
Around the world, businesses are turning to AI to transform the way they operate and serve their customers. But before they can implement these technologies, companies must address the roadblock of moving from batch analytics to making real-time decisions by rapidly accessing and analyzing the relevant information amidst a sea of data. Yaron will explain how to make Spark handle multivariate real-time, historical and event data simultaneously to provide immediate and intelligent responses. He will present several time sensitive use-cases including fraud detection, prevention of outages and customer recommendations to demonstrate how to perform predictive analytics and real-time actions with Spark.
Speaker: Yaron Ekshtein
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
People, Platform, Projects: these slides overview how Netflix works with Big Data. I share how our teams are organized, the roles we typically have on the teams, an overview of our Big Data Platform, and two example projects.
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.
From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations.
Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
Around the world, businesses are turning to AI to transform the way they operate and serve their customers. But before they can implement these technologies, companies must address the roadblock of moving from batch analytics to making real-time decisions by rapidly accessing and analyzing the relevant information amidst a sea of data. Yaron will explain how to make Spark handle multivariate real-time, historical and event data simultaneously to provide immediate and intelligent responses. He will present several time sensitive use-cases including fraud detection, prevention of outages and customer recommendations to demonstrate how to perform predictive analytics and real-time actions with Spark.
Speaker: Yaron Ekshtein
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
People, Platform, Projects: these slides overview how Netflix works with Big Data. I share how our teams are organized, the roles we typically have on the teams, an overview of our Big Data Platform, and two example projects.
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.
From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
Quby, an Amsterdam-based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI powered products that are used by hundreds of thousands of users on a daily basis. Delta Lake ensures the quality of incoming records though schema enforcement and evolution. But it is the Data Engineers role to check whether the expected data is ingested in to the Delta Lake at the right time with expected metrics so that downstream processes will function their duties. Re-training models and serving on the fly might go wrong unless we put the right monitoring infrastructure too.
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
In healthcare, DICOM is an international standard format for storing medical images (MRI/CT representations). Each image has associated with it embedded metadata and pixel data. There is currently a tremendous amount of effort in healthcare to incorporate image analytics within clinical data analysis. Apache Spark is a natural framework to integrate these efforts.
This session presents an analytics workflow using Apache Spark to perform ETL on DICOM images, and then to perform Eigen decomposition to derive meaningful insights on the pixel data. The workflow integrates a Java based framework DCM4CHE with Apache Spark to parallelize the big data workload for fast processing. Users can extract features based on the metadata and run efficient clean/filter/drill-down for preprocessing. See a demonstration of predictive analytics with visualization using the metadata to derive insights, such as likelihood of a condition or efficacy of medication administered.
The speakers will also present performance benchmarks of this workflow on various datasets and cluster configurations to demonstrate the benefits of running this kind of analysis workflow on Apache Spark.
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
Deploying machine learning models seems like it should be a relatively easy task. Take your model and pass it some features in production. The reality is that the code written during the prototyping phase of model development doesn’t always work when applied at scale or on “real” data. This talk will explore 1) common problems at the intersection of data science and data engineering 2) how you can structure your code so there is minimal friction between prototyping and production, and 3) how you can use Apache Spark to run predictions on your models in batch or streaming contexts.
You will take away how to address some of productionizing issues that data scientists and data engineers face while deploying machine learning models at scale and a better understanding of how to work collaboratively to minimize disparity between prototyping and productizing.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
Energy wastage by residential buildings is a significant contributor to total worldwide energy consumption. Quby, an Amsterdam based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage.
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
Building a data platform doesn’t have to be like entering a portal to Stranger Things.
Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale.
Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
This talk will address one apparently simple question: which open source machine learning tools one can use out-of-the-box to do binary classification (probably the most common machine learning problem) on largish datasets. We will discuss therefore the speed, scalability and accuracy of the top (most accurate) learning algorithms in R packages, Python’s scikit-learn, H2O, xgboost and Spark’s MLlib.
Zeppelin at Twitter - Prasad Wagle, Technical Lead in the Data Platform team - Twitter
Prasad will talk about how Zeppelin is used at Twitter, the development work they did before release and the features and enhancements they are working on to increase adoption.
Building the Artificially Intelligent EnterpriseDatabricks
This session looks at where we are today with data and analytics and what is needed to transition to the Artificially Intelligent Enterprise.
How do you mobilise developers to exploit what data scientists and business analysts have built? How do you align it all with business strategy to maximise business outcomes? How do you combine BI, predictive and prescriptive analytics, automation and reinforcement learning to get maximum value across the enterprise? What is the blueprint for building the artificially intelligent enterprise?
•Data and analytics – Where are we?
•Why is the journey only half-way done?
•2021 and beyond – The new era of AI usage and not just build
•The requirement – event-driven, on-demand and automated analytics
•Operationalising what you build – DataOps, MLOps and RPA
•Mobilising the masses to integrate AI into processes – what needs to be done?
•Business strategy alignment – the guiding light to AI utilisation for high reward
•Agility step change – the shift to no-code integration of AI by citizen developers
•Recording decisions, and analysing business impact
•Reinforcement-learning – transitioning to continuous reward
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
2018 Women in Analytics Conference
https://www.womeninanalytics.org/
Over the last year I’ve become obsessed with learning how to be a better "cloud computing evangelist to data scientists" - specifically to the R community. I’ve learned that this isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of changing up the tools and workflows they’ve come to rely on when those systems seem to be working. Resistance to change increases even further with barriers to quick adoption, such as having to teach yourself a completely new technology or framework. I’d like to give a talk about how working in the cloud changes data science and how exploring these tools can lead to a world of new possibilities within the intersection of DevOps and Data Analytics.
Topics to discuss:
- Working through functionality/engineering challenges with R in a cloud environment
- Opportunities to customize and craft your ideal version of R/RStudio
- Making and embracing a decision on what is “real" about your analysis or daily work (Chapter 6 in R for Data Science)
- Running multiple R instances in the cloud (why would you want to do this?)
- Becoming an R/Data Science Collaboration wizard: Building APIs with Plumber in the Cloud
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
Quby, an Amsterdam-based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI powered products that are used by hundreds of thousands of users on a daily basis. Delta Lake ensures the quality of incoming records though schema enforcement and evolution. But it is the Data Engineers role to check whether the expected data is ingested in to the Delta Lake at the right time with expected metrics so that downstream processes will function their duties. Re-training models and serving on the fly might go wrong unless we put the right monitoring infrastructure too.
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
In healthcare, DICOM is an international standard format for storing medical images (MRI/CT representations). Each image has associated with it embedded metadata and pixel data. There is currently a tremendous amount of effort in healthcare to incorporate image analytics within clinical data analysis. Apache Spark is a natural framework to integrate these efforts.
This session presents an analytics workflow using Apache Spark to perform ETL on DICOM images, and then to perform Eigen decomposition to derive meaningful insights on the pixel data. The workflow integrates a Java based framework DCM4CHE with Apache Spark to parallelize the big data workload for fast processing. Users can extract features based on the metadata and run efficient clean/filter/drill-down for preprocessing. See a demonstration of predictive analytics with visualization using the metadata to derive insights, such as likelihood of a condition or efficacy of medication administered.
The speakers will also present performance benchmarks of this workflow on various datasets and cluster configurations to demonstrate the benefits of running this kind of analysis workflow on Apache Spark.
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
Deploying machine learning models seems like it should be a relatively easy task. Take your model and pass it some features in production. The reality is that the code written during the prototyping phase of model development doesn’t always work when applied at scale or on “real” data. This talk will explore 1) common problems at the intersection of data science and data engineering 2) how you can structure your code so there is minimal friction between prototyping and production, and 3) how you can use Apache Spark to run predictions on your models in batch or streaming contexts.
You will take away how to address some of productionizing issues that data scientists and data engineers face while deploying machine learning models at scale and a better understanding of how to work collaboratively to minimize disparity between prototyping and productizing.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
Energy wastage by residential buildings is a significant contributor to total worldwide energy consumption. Quby, an Amsterdam based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage.
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
Building a data platform doesn’t have to be like entering a portal to Stranger Things.
Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale.
Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
This talk will address one apparently simple question: which open source machine learning tools one can use out-of-the-box to do binary classification (probably the most common machine learning problem) on largish datasets. We will discuss therefore the speed, scalability and accuracy of the top (most accurate) learning algorithms in R packages, Python’s scikit-learn, H2O, xgboost and Spark’s MLlib.
Zeppelin at Twitter - Prasad Wagle, Technical Lead in the Data Platform team - Twitter
Prasad will talk about how Zeppelin is used at Twitter, the development work they did before release and the features and enhancements they are working on to increase adoption.
Building the Artificially Intelligent EnterpriseDatabricks
This session looks at where we are today with data and analytics and what is needed to transition to the Artificially Intelligent Enterprise.
How do you mobilise developers to exploit what data scientists and business analysts have built? How do you align it all with business strategy to maximise business outcomes? How do you combine BI, predictive and prescriptive analytics, automation and reinforcement learning to get maximum value across the enterprise? What is the blueprint for building the artificially intelligent enterprise?
•Data and analytics – Where are we?
•Why is the journey only half-way done?
•2021 and beyond – The new era of AI usage and not just build
•The requirement – event-driven, on-demand and automated analytics
•Operationalising what you build – DataOps, MLOps and RPA
•Mobilising the masses to integrate AI into processes – what needs to be done?
•Business strategy alignment – the guiding light to AI utilisation for high reward
•Agility step change – the shift to no-code integration of AI by citizen developers
•Recording decisions, and analysing business impact
•Reinforcement-learning – transitioning to continuous reward
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
2018 Women in Analytics Conference
https://www.womeninanalytics.org/
Over the last year I’ve become obsessed with learning how to be a better "cloud computing evangelist to data scientists" - specifically to the R community. I’ve learned that this isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of changing up the tools and workflows they’ve come to rely on when those systems seem to be working. Resistance to change increases even further with barriers to quick adoption, such as having to teach yourself a completely new technology or framework. I’d like to give a talk about how working in the cloud changes data science and how exploring these tools can lead to a world of new possibilities within the intersection of DevOps and Data Analytics.
Topics to discuss:
- Working through functionality/engineering challenges with R in a cloud environment
- Opportunities to customize and craft your ideal version of R/RStudio
- Making and embracing a decision on what is “real" about your analysis or daily work (Chapter 6 in R for Data Science)
- Running multiple R instances in the cloud (why would you want to do this?)
- Becoming an R/Data Science Collaboration wizard: Building APIs with Plumber in the Cloud
Codemotion Milan 2018 - AI with a devops mindset: experimentation, sharing an...Thiago de Faria
AI is the buzzword while ML is the underlying component... but when do we use ML? To solve problems that machines can find patterns without explicitly programming them to do so. But do you have a team building an ML model? How far are they from the IT team? Do they know how to deploy and serve that? Testing? And sharing what they have done? That's where a devops mindset comes in: reduce the batch size, continuous-everything and a culture of failure/experimentation are vital for your data team! In the end, I will show how the workflow of a data scientist can be in real life with a live demo!
Thiago de Faria - AI with a devops mindset - experimentation, sharing and eas...Codemotion
AI is the buzzword while ML is the underlying component... but when do we use ML? To solve problems that machines can find patterns without explicitly programming them to do so. But do you have a team building an ML model? How far are they from the IT team? Do they know how to deploy and serve that? Testing? And sharing what they have done? That's where a devops mindset comes in: reduce the batch size, continuous-everything and a culture of failure/experimentation are vital for your data team! In the end, I will show how the workflow of a data scientist can be on the real life with a live demo!
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
A changing market landscape and open source innovations are having a dramatic impact on the consumability and ease of use of data science tools. Join this session to learn about the impact these trends and changes will have on the future of data science. If you are a data scientist, or if your organization relies on cutting edge analytics, you won't want to miss this!
A series of tweets I posted about my 11hr struggle to make a cup of tea with my WiFi kettle ended-up going viral, got picked-up by the national and then international press, and led to thousands of retweets, comments and references in the media. In this session we’ll take the data I recorded on this Twitter activity over the period and use Oracle Big Data Graph and Spatial to understand what caused the breakout and the tweet going viral, who were the key influencers and connectors, and how the tweet spread over time and over geography from my original series of posts in Hove, England.
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays
apidays LIVE Australia 2021 - Accelerating Digital
September 15 & 16, 2021
Tracing across your distributed process boundaries using OpenTelemetry
Dasith Wijes, Senior Consultant at Microsoft (Azure Cloud & AI Team)
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Data scientists utilize a variety of tools and techniques to obtain insights from data. In this session, we discuss where and how Splunk fits into the data scientist's tool belt. We highlight Splunk’s built-in statistical capabilities and integrate external statistical and graphical tools to showcase data preparation, predictive modeling and visualization.
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
This Data Science with Python presentation will help you understand what is Data Science, basics of Python for data analysis, why learn Python, how to install Python, Python libraries for data analysis, exploratory analysis using Pandas, introduction to series and dataframe, loan prediction problem, data wrangling using Pandas, building a predictive model using Scikit-Learn and implementing logistic regression model using Python. The aim of this video is to provide a comprehensive knowledge to beginners who are new to Python for data analysis. This video provides a comprehensive overview of basic concepts that you need to learn to use Python for data analysis. Now, let us understand how Python is used in Data Science for data analysis.
This Data Science with Python presentation will cover the following topics:
1. What is Data Science?
2. Basics of Python for data analysis
- Why learn Python?
- How to install Python?
3. Python libraries for data analysis
4. Exploratory analysis using Pandas
- Introduction to series and dataframe
- Loan prediction problem
5. Data wrangling using Pandas
6. Building a predictive model using Scikit-learn
- Logistic regression
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you'll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques.
Learn more at: https://www.simplilearn.com
Similar to Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with David Taieb (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with David Taieb
1. @DTAIEB55
Taking Jupyter Notebooks and
Apache Spark to the next level with
PixieDust
David Taieb
Distinguished Engineer
IBM Watson Data Platform, Developer Advocacy
@DTAIEB55
2. @DTAIEB55
• More companies making bet-the-business data driven decisions
• Good news: they are drowning in Data
• Bad news: they are drowning in Data
• Solving the Data problems of tomorrow cannot be done by data
scientists alone.
• Developers are getting more involved with Data Science, moving
from stovepipe applications to “data pipelines” that integrate data
and analytics.
WHY ARE YOU HERE?
3. @DTAIEB55
Disclaimer: All characters and events depicted in this story are entirely
fictitious. Any similarity to actual use cases, events or persons is actually
intentional.
Let’s start with a story… we all know too well.
How do we blur the lines between
developers and data scientists?
4. @DTAIEB55
• Hold a master degree in computer science
• 10 year experience, 6 years with the company
• Languages of choice: Java, Node.js, HTML5/CSS3
• Data: No SQL (Cloudant, Mongo), relational
• No major experience with Big Data
T H E D E V E L O P E R
“The best line of code is the one I didn't have to write!”
MEET BEN
5. @DTAIEB55
• Hold a PHD in data science
• 5 year experience, 2 years with the company
• Experienced in Python and R
• Expert in Machine Learning and Data visualization
• Software engineering is not her thing
T H E D A T A S C I E N T I S T
“In God we trust. All others bring data.”
— W. Edwards Deming
MEET NATASHA
6. @DTAIEB55
SURPRISE MEETING
With the VP of Development
“We have an urgent need for our marketing department
to build an application that can provide real-time sentiment
analysis on Twitter data.”
https://unsplash.com/search/meeting?photo=3fPXt37X6UQ
7. @DTAIEB55
• You only have 6 weeks to build the application
• Target consumer is the business-focused user
• Must be easy to use even for non technical people
• It must scale out of the box
• I want you to look at Apache Spark
KEY CONSTRAINTS
9. @DTAIEB55
Great Question Natasha!
B e s t w a y t o a n s w e r i t i s t o a r r a n g e a t i c k e t t o t h e
S p a r k S u m m i t f o r y o u t o f i n d o u t
13. @DTAIEB55
WHAT IS A NOTEBOOK?
• Web based UI for
running Apache
Spark console
commands
• Easy, no install spark
accelerator
• Best way to start
working with spark
• Multiple flavors
• Jupyter
• Zeppelin
• Local or cloud hosted
• IBM Data Science
Experience
• Databricks
Text
Annotations
Code
Data
Visualizations
Widgets
Output
15. @DTAIEB55
BIG DATA ANALYSIS
code
results
Kernel with Spark
support
Services:
Congitive, …
Libraries:
Statistics, Math,
Machine Learning,
Plotting,
Data
(flat files, relational database,
NoSQL database, …)
Worker
Worker
...
Worker
16. @DTAIEB55
— BEN
“But they seem complicated for
developers like me”
NOTEBOOKS ARE POWERFUL TOOLS FOR DATA SCIENTISTS
17. @DTAIEB55
• Visualize data (e.g., Table, Charts, Map, etc)
• Data Management with PixieApps
• Download/export data (e.g., File, Cloudant, etc.)
• Use Scala directly in a Python notebook
• Install Spark packages into Python notebook
• Spark job progress monitor
• Extensible
Open Source Python helper library for Jupyter Notebooks
ENTER PIXIEDUST
https://github.com/ibm-cds-labs/pixiedust
18. @DTAIEB55
DEMO: PIXIEDUST DATA
VISUALIZATION
Easy Visualizations
https://github.com/ibm-cds-labs/pixiedust/blob/master/notebook/PixieDust%201%20-%20Easy%20Visualizations.ipynb
Working with External Data
https://github.com/ibm-cds-labs/pixiedust/blob/master/notebook/PixieDust%202%20-%20Working%20with%20External%20Data.ipynb
22. @DTAIEB55
Enter PixieApps
• PixieApps are Python classes used to write UI for your analytics that runs directly in a Jupyter
Notebook
• Easy to build: mostly HTML and CSS with some custom attributes (micro-format style)
• Leverage PixieDust Display visualization for charting
• With PixieApps you can:
• Create different html views with routes to invoke them
• Invoke Python Scripts from user interactions
• Run in the notebook cell output or in a Dialog
• and much more…
• Use cases:
• Dashboards
• Data Browsers
• Data Pipeline Management
26. @DTAIEB55
— BEN
“Do I need to learn yet another framework?”
WHAT DOES IT TAKE TO BUILD A PIXIEAPP?
27. @DTAIEB55
PIXIEAPP HELLO WORLD
Import app package to start things off
Simple annotation to tell PixieDust it’s an app
Define the default route (no args).
Method will return the view’s html fragment
Html fragment for the view.
Allows Jinja2 template macros
Define a new route that triggers when option
clicked is set to true
set option clicked to true when button is
pushed, so that correct route is loaded next
Import app package to start things off
28. @DTAIEB55
PIXIEAPP HELLO WORLD WITH DATA
Placeholder div for displaying data
Display the output in the specified target
Entity binding: use the df passed by user
Allows binding of any entity created by the app
Specify Display options for visualization
Pass data to the app
30. @DTAIEB55
• I’ll work on data acquisition from Twitter
and enrichment with sentiment analysis
scores using Spark Streaming
• I know Java very well, but I don’t have
time to learn Python.
• However, I am willing to learn Scala if
that helps improve my productivity
S T A R T B R A I N S T O R M I N G
• I’ll perform the data exploration and
analysis
• I know Python and R, but I am not
familiar enough with Java or Scala
• I like pandas and numpy. I’m ok to learn
Spark but expect the same level of apis
• I need to work iteratively with the data
I’ll need to do some data exploration too. I’ll need APIs to access my data.
BEN and NATASHA
31. @DTAIEB55
• Implement a Spark Streaming
connector to Twitter
• Call Watson Tone Analyzer for each
tweets
• Return a Spark DataFrame with the
tweets enriched with Tone scores
• Code written in Scala, delivered as a
Jar
D I V I D I N G T H E T A S K S
BEN and NATASHA
• Works in a Python Notebook
• Using PixieDust PackageManager,
install the Scala library delivered by
Ben to load the twitter data with Tone
scores
• Using PixieDust display() api, perform
the data exploration and analysis:
trending hashtags and sentiments
• Produce visualizations to LOB Users
35. @DTAIEB55
What’s next for PixieDust
• Support Visualization for Streaming data
– Start with Structured Streaming and IBM Streams
• Ability to publish/embed PixieApps into Web Application (Nodejs to
begin with)
• PixieDust visualization enhancements
– Custom colors
– Custom GeoJSON layers for maps
– Sorting/filtering
– More renderers: Brunel, ArcGIS, etc.
– …
• Ability to run Node.js code to load and visualize data
• Support for Jupyter Labs and Jupyter Hub
36. @DTAIEB55
As always…
We look forward for your feedback
and pull requests on GitHub
https://github.com/ibm-cds-labs/pixiedust
37. @DTAIEB55
CONCLUSION
• Solving the Data problems of tomorrow cannot be done by
data scientists alone.
• Notebooks, considered by most to be the domain of data
scientists, can help break down traditional silos and help
team of all types who are working on data problems
Try it for yourself today:
• IBM Data Science Experience
http://datascience.ibm.com/
• Locally using PixieDust automated installer
https://ibm-cds-labs.github.io/pixiedust/install.html