How can people collaborate over data analysis without disclosing their data to each other? This seminar will cover an end-to-end solution to this problem, including privacy preserving entity resolution and the application of partial homomorphic encryption and Rademacher observations to private linear classification tasks.
In particular we will show that it is possible to learn from data, while keeping the data confidential, both with and without the entity resolution step. We will give a brief overview of potential applications and give some practical examples of how these approaches can be used.
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the Data Warehouse or to facilitate competitive Data Science and building algorithms in the organization, the Data Lake — a place for unmodeled and vast data — will be provisioned widely in 2019.
Though it doesn’t have to be complicated, the Data Lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the Data Swamp, but not the Data Lake! The tool ecosystem is building up around the Data Lake and soon many will have a robust Lake and Data Warehouse. We will discuss policy to keep them straight, send “horses to courses,” and keep up users’ confidence in the Data Platforms.
As for platform, although Hadoop received the early majority of Data Lakes, organizations are now weighing in that the Data Lake will be built in Cloud object storage. We’ll discuss these options as well.
Get this data point for your Data Lake journey.
Data management best practices - infographicIntellspot
Best practices for data management including data governance, data stewardship, data integration, data quality, and enterprise master data management best practices and strategies.
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
With technological innovation and change occurring at an ever-increasing rate, it’s hard to keep track of what’s hype and what can provide practical value for your organization. Join this webinar to see the results of a recent DATAVERSITY survey on emerging trends in Data Architecture, along with practical commentary and advice from industry expert Donna Burbank.
A Beginner's Guide to Large Language ModelsAjitesh Kumar
Large Language Models (LLMs) are a type of deep learning model designed to process and understand vast amounts of natural language data. Built on neural network architectures, particularly the transformer architecture, LLMs have revolutionized the field of natural language processing. In this presentation, we will explore the world of LLMs, their significance, and the different types of LLMs based on the transformer architecture, such as autoregressive language models (e.g., GPT), autoencoding language models (e.g., BERT), and combined models (e.g., T5). Join us as we delve into the world of LLMs and discover their potential in shaping the future of natural language processing.
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the Data Warehouse or to facilitate competitive Data Science and building algorithms in the organization, the Data Lake — a place for unmodeled and vast data — will be provisioned widely in 2019.
Though it doesn’t have to be complicated, the Data Lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the Data Swamp, but not the Data Lake! The tool ecosystem is building up around the Data Lake and soon many will have a robust Lake and Data Warehouse. We will discuss policy to keep them straight, send “horses to courses,” and keep up users’ confidence in the Data Platforms.
As for platform, although Hadoop received the early majority of Data Lakes, organizations are now weighing in that the Data Lake will be built in Cloud object storage. We’ll discuss these options as well.
Get this data point for your Data Lake journey.
Data management best practices - infographicIntellspot
Best practices for data management including data governance, data stewardship, data integration, data quality, and enterprise master data management best practices and strategies.
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
With technological innovation and change occurring at an ever-increasing rate, it’s hard to keep track of what’s hype and what can provide practical value for your organization. Join this webinar to see the results of a recent DATAVERSITY survey on emerging trends in Data Architecture, along with practical commentary and advice from industry expert Donna Burbank.
A Beginner's Guide to Large Language ModelsAjitesh Kumar
Large Language Models (LLMs) are a type of deep learning model designed to process and understand vast amounts of natural language data. Built on neural network architectures, particularly the transformer architecture, LLMs have revolutionized the field of natural language processing. In this presentation, we will explore the world of LLMs, their significance, and the different types of LLMs based on the transformer architecture, such as autoregressive language models (e.g., GPT), autoencoding language models (e.g., BERT), and combined models (e.g., T5). Join us as we delve into the world of LLMs and discover their potential in shaping the future of natural language processing.
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
With the aid of any number of data management and processing tools, data flows through multiple on-prem and cloud storage locations before it’s delivered to business users. As a result, IT teams — including IT Ops, DataOps, and DevOps — are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
Join Stonebranch’s Scott Davis, Global Vice President and Ravi Murugesan, Sr. Solution Engineer to learn how DataOps teams orchestrate their end-to-end data pipelines with a platform approach to managing automation.
Key Learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business — with the observability you require
- Discover how to replace clunky legacy approaches to streaming data in a multi-cloud environment
- See what’s possible with the Stonebranch Universal Automation Center (UAC)
Unifying Space Mission Knowledge with NLP & Knowledge GraphVaticle
Synopsis
The number of space missions being designed and launched worldwide is growing exponentially. Information on these missions, such as their objectives, orbit, or payload, is disseminated across various documents and datasets. Facilitating access to this information is key to accelerating the design of future missions, enabling experts to link an application to a mission, and following various stakeholders' activities.
This presentation introduces recent research done at the ESA to combine the latest Language Models with Knowledge Graphs, unifying our knowledge on space missions. Language Models such as GPT-3 and BERT are trained to understand the patterns of human (natural) language. These models have revolutionised the field of NLP, the branch of AI enabling machines to understand human language in all its complexity. In this work, key information on a mission is parsed from documents with the GPT-3 model, and the parsed data is then migrated to a TypeDB Knowledge Graph to be easily queried. Although this work focuses on an application in the space sector, the method can be transferred to other engineering fields.
Presenters
Dr. Audrey Berquand is a Research Fellow at the ESA. Her research aims at enhancing space mission design and knowledge management with text mining, NLP, and Knowledge Graphs. She was awarded her PhD in 2021 from the University of Strathclyde (Scotland) for her thesis on “Text Mining and Natural Language Processing for the Early Stages of Space Mission Design”. Audrey has a background in space systems engineering, she holds an MSc in Aerospace Engineering from the Royal Institute of Technology KTH (Sweden), and a diplôme d'ingénieur from the EPF Graduate School of Engineering (France). Before diving into the world of AI, she spent 3 years at ESA being involved in the early design phases of future Earth Observation missions.
Ana Victória Ladeira works with Knowledge Management at the ESA, using automated methods to exploit the information contained in the piles and piles of documents that ESA generates every day. With a Masters degree in Data Science from Maastricht University, Ana is particularly excited about how NLP methods can help large organizations connect different documents and highlight the bigger picture over a big universe of data sources, as well as using Knowledge Graphs to help connect people to the expertise and information they need.
What is Amazon OpenSearch Service?
OpenSearch is a distributed, open-source search and analytics package that may be used for real-
time application monitoring, log analysis, and internet search, among other things. With OpenSearch
Dashboards, an integrated visualization tool that makes it easy for users to examine their data,
OpenSearch provides a highly scalable solution for quick access and reaction to massive amounts of
data. The Apache Lucene search library, as well as OpenSearch, Elasticsearch, and Apache Solr,
support it. Elasticsearch 7.10.2 and Kibana 7.10.2 were used to create OpenSearch and OpenSearch
Dashboards. The Apache License Version 2.0 applies to all software in the OpenSearch project (ALv2).
Customer-Centric Data Management for Better Customer ExperiencesInformatica
With consumer and business buyer expectations growing exponentially, more businesses are competing on the basis of customer experience. But executing preferred customer experiences requires data about who your customers are today and what will they likely need in the future. Every business can benefit from an AI-powered master data management platform to supply this information to line-of-business owners so they can execute great experiences at scale. This same need is true from an internal business process perspective as well. For example, many businesses require better data management practices to deliver preferred employee experiences. Informatica provides an MDM platform to solve for these examples and more.
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
Nowadays more and more companies are searching for insights with potential to grow their business by analyzing large amounts of data from many different systems. However, in order to reach this level of Big Data Analysis it's necessary to build an ETL pipeline that allows us to process raw data coming from different sources into an appropriate format that is possible to use against visualization tools such as Tableau.
This kind of data processing can be done by a variety of tools and in this presentation I show how to do it by using an unified programming model created by Google and open-sourced as the name of Apache Beam. We will build a simple pipeline that will be executed in the Cloud by a fully-managed service called Google Cloud Dataflow.
ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.
As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
Organizations have been chasing the dream of data democratization, unlocking and accessing data at scale to serve their customers and business, for over a half a century from early days of data warehousing. They have been trying to reach this dream through multiple generations of architectures, such as data warehouse and data lake, through a cambrian explosion of tools and a large amount of investments to build their next data platform. Despite the intention and the investments the results have been middling.
In this keynote, Zhamak shares her observations on the failure modes of a centralized paradigm of a data lake, and its predecessor data warehouse.
She introduces Data Mesh, a paradigm shift in big data management that draws from modern distributed architecture: considering domains as the first class concern, applying self-sovereignty to distribute the ownership of data, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
This talk introduces the principles underpinning data mesh and Zhamak's recent learnings in creating a path to bring data mesh to life in your organization.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaEdureka!
( Google Cloud Certification Training - Cloud Architect: https://www.edureka.co/google-cloud-a... ) This Tutorial on Google Cloud Platform will provide you a detailed introduction to GCP and it's Cloud Services Services. Learn why GCP is preferred over other cloud Providers and also learn about the various Zones and Regions where the servers are hosted.
Data Catalog as the Platform for Data IntelligenceAlation
Data catalogs are in wide use today across hundreds of enterprises as a means to help data scientists and business analysts find and collaboratively analyze data. Over the past several years, customers have increasingly used data catalogs in applications beyond their search & discovery roots, addressing new use cases such as data governance, cloud data migration, and digital transformation. In this session, the founder and CEO of Alation will discuss the evolution of the data catalog, the many ways in which data catalogs are being used today, the importance of machine learning in data catalogs, and discuss the future of the data catalog as a platform for a broad range of data intelligence solutions.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Data Strategy - Executive MBA Class, IE Business SchoolGam Dias
For today's enterprise Data is now very much a corporate asset, vital to delivering products and services efficiently and cost effectively. There are few organizations that can survive without harnessing data in some way.
Viewed as a strategic asset, data can be a source of new internal efficiencies, improved competitive advantage or a source of entirely new products that can be targeted at your existing or new customers.
This slide deck contains the highlights of a one day course on Data Strategy taught as part of the Executive MBA Program at IE Business School in Madrid.
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
Describes what Enterprise Data Architecture in a Software Development Organization should cover and does that by listing over 200 data architecture related deliverables an Enterprise Data Architect should remember to evangelize.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
The extent and impact of recent security breaches is showing that current security approaches are just not working. But what can we do to protect our business? We have been advocating monitoring for a long time as a way to detect subtle, advanced attacks that are still making it through our defenses. However, products have failed to deliver on this promise.
Current solutions don't scale in both data volume and analytical insights. In this presentation we will explore what security monitoring is. Specifically, we are going to explore the question of how to visualize a billion log records. A number of security visualization examples will illustrate some of the challenges with big data visualization. They will also help illustrate how data mining and user experience design help us get a handle on the security visualization challenges - enabling us to gain deep insight for a number of security use-cases.
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
With the aid of any number of data management and processing tools, data flows through multiple on-prem and cloud storage locations before it’s delivered to business users. As a result, IT teams — including IT Ops, DataOps, and DevOps — are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
Join Stonebranch’s Scott Davis, Global Vice President and Ravi Murugesan, Sr. Solution Engineer to learn how DataOps teams orchestrate their end-to-end data pipelines with a platform approach to managing automation.
Key Learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business — with the observability you require
- Discover how to replace clunky legacy approaches to streaming data in a multi-cloud environment
- See what’s possible with the Stonebranch Universal Automation Center (UAC)
Unifying Space Mission Knowledge with NLP & Knowledge GraphVaticle
Synopsis
The number of space missions being designed and launched worldwide is growing exponentially. Information on these missions, such as their objectives, orbit, or payload, is disseminated across various documents and datasets. Facilitating access to this information is key to accelerating the design of future missions, enabling experts to link an application to a mission, and following various stakeholders' activities.
This presentation introduces recent research done at the ESA to combine the latest Language Models with Knowledge Graphs, unifying our knowledge on space missions. Language Models such as GPT-3 and BERT are trained to understand the patterns of human (natural) language. These models have revolutionised the field of NLP, the branch of AI enabling machines to understand human language in all its complexity. In this work, key information on a mission is parsed from documents with the GPT-3 model, and the parsed data is then migrated to a TypeDB Knowledge Graph to be easily queried. Although this work focuses on an application in the space sector, the method can be transferred to other engineering fields.
Presenters
Dr. Audrey Berquand is a Research Fellow at the ESA. Her research aims at enhancing space mission design and knowledge management with text mining, NLP, and Knowledge Graphs. She was awarded her PhD in 2021 from the University of Strathclyde (Scotland) for her thesis on “Text Mining and Natural Language Processing for the Early Stages of Space Mission Design”. Audrey has a background in space systems engineering, she holds an MSc in Aerospace Engineering from the Royal Institute of Technology KTH (Sweden), and a diplôme d'ingénieur from the EPF Graduate School of Engineering (France). Before diving into the world of AI, she spent 3 years at ESA being involved in the early design phases of future Earth Observation missions.
Ana Victória Ladeira works with Knowledge Management at the ESA, using automated methods to exploit the information contained in the piles and piles of documents that ESA generates every day. With a Masters degree in Data Science from Maastricht University, Ana is particularly excited about how NLP methods can help large organizations connect different documents and highlight the bigger picture over a big universe of data sources, as well as using Knowledge Graphs to help connect people to the expertise and information they need.
What is Amazon OpenSearch Service?
OpenSearch is a distributed, open-source search and analytics package that may be used for real-
time application monitoring, log analysis, and internet search, among other things. With OpenSearch
Dashboards, an integrated visualization tool that makes it easy for users to examine their data,
OpenSearch provides a highly scalable solution for quick access and reaction to massive amounts of
data. The Apache Lucene search library, as well as OpenSearch, Elasticsearch, and Apache Solr,
support it. Elasticsearch 7.10.2 and Kibana 7.10.2 were used to create OpenSearch and OpenSearch
Dashboards. The Apache License Version 2.0 applies to all software in the OpenSearch project (ALv2).
Customer-Centric Data Management for Better Customer ExperiencesInformatica
With consumer and business buyer expectations growing exponentially, more businesses are competing on the basis of customer experience. But executing preferred customer experiences requires data about who your customers are today and what will they likely need in the future. Every business can benefit from an AI-powered master data management platform to supply this information to line-of-business owners so they can execute great experiences at scale. This same need is true from an internal business process perspective as well. For example, many businesses require better data management practices to deliver preferred employee experiences. Informatica provides an MDM platform to solve for these examples and more.
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
Nowadays more and more companies are searching for insights with potential to grow their business by analyzing large amounts of data from many different systems. However, in order to reach this level of Big Data Analysis it's necessary to build an ETL pipeline that allows us to process raw data coming from different sources into an appropriate format that is possible to use against visualization tools such as Tableau.
This kind of data processing can be done by a variety of tools and in this presentation I show how to do it by using an unified programming model created by Google and open-sourced as the name of Apache Beam. We will build a simple pipeline that will be executed in the Cloud by a fully-managed service called Google Cloud Dataflow.
ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.
As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
Organizations have been chasing the dream of data democratization, unlocking and accessing data at scale to serve their customers and business, for over a half a century from early days of data warehousing. They have been trying to reach this dream through multiple generations of architectures, such as data warehouse and data lake, through a cambrian explosion of tools and a large amount of investments to build their next data platform. Despite the intention and the investments the results have been middling.
In this keynote, Zhamak shares her observations on the failure modes of a centralized paradigm of a data lake, and its predecessor data warehouse.
She introduces Data Mesh, a paradigm shift in big data management that draws from modern distributed architecture: considering domains as the first class concern, applying self-sovereignty to distribute the ownership of data, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
This talk introduces the principles underpinning data mesh and Zhamak's recent learnings in creating a path to bring data mesh to life in your organization.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaEdureka!
( Google Cloud Certification Training - Cloud Architect: https://www.edureka.co/google-cloud-a... ) This Tutorial on Google Cloud Platform will provide you a detailed introduction to GCP and it's Cloud Services Services. Learn why GCP is preferred over other cloud Providers and also learn about the various Zones and Regions where the servers are hosted.
Data Catalog as the Platform for Data IntelligenceAlation
Data catalogs are in wide use today across hundreds of enterprises as a means to help data scientists and business analysts find and collaboratively analyze data. Over the past several years, customers have increasingly used data catalogs in applications beyond their search & discovery roots, addressing new use cases such as data governance, cloud data migration, and digital transformation. In this session, the founder and CEO of Alation will discuss the evolution of the data catalog, the many ways in which data catalogs are being used today, the importance of machine learning in data catalogs, and discuss the future of the data catalog as a platform for a broad range of data intelligence solutions.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Data Strategy - Executive MBA Class, IE Business SchoolGam Dias
For today's enterprise Data is now very much a corporate asset, vital to delivering products and services efficiently and cost effectively. There are few organizations that can survive without harnessing data in some way.
Viewed as a strategic asset, data can be a source of new internal efficiencies, improved competitive advantage or a source of entirely new products that can be targeted at your existing or new customers.
This slide deck contains the highlights of a one day course on Data Strategy taught as part of the Executive MBA Program at IE Business School in Madrid.
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
Describes what Enterprise Data Architecture in a Software Development Organization should cover and does that by listing over 200 data architecture related deliverables an Enterprise Data Architect should remember to evangelize.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
The extent and impact of recent security breaches is showing that current security approaches are just not working. But what can we do to protect our business? We have been advocating monitoring for a long time as a way to detect subtle, advanced attacks that are still making it through our defenses. However, products have failed to deliver on this promise.
Current solutions don't scale in both data volume and analytical insights. In this presentation we will explore what security monitoring is. Specifically, we are going to explore the question of how to visualize a billion log records. A number of security visualization examples will illustrate some of the challenges with big data visualization. They will also help illustrate how data mining and user experience design help us get a handle on the security visualization challenges - enabling us to gain deep insight for a number of security use-cases.
This session took place at New York City on November 4th, 2019.
Speaker Bio:
Chemere is a Senior Data Science Training Specialist for H2O.ai. Chemere has a Master's in Business Administration with focus in Marketing Analytics from the University of North Carolina at Charlotte. She is an experienced data scientist with a diverse background in transformational decision-making in various industries including Banking, Manufacturing, Logistics, and Medical Devices. Chemere joins us from Venus Concept/2two5, where she was the Lead Data Scientist focused on building predictive models with Internet of Things (IoT) data and for a subscription-based marketing product for B2B customers. Prior to that, Chemere worked as a Senior Data Scientist at Wells Fargo Bank focused on various applied predictive analytic solutions.
More details about the event can be had here: https://www.eventbrite.com/e/dive-into-h2o-new-york-tickets-76351721053
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska, Data Scientist (ING Tech)
Mariusz Derela, DevOps Engineer (ING Tech)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
How do we protect privacy of users in large-scale systems? How do we ensure fairness and transparency when developing machine learned models? With the ongoing explosive growth of AI/ML models and systems, these are some of the ethical and legal challenges encountered by researchers and practitioners alike. In this talk (presented at QConSF 2018), we first present an overview of privacy breaches as well as algorithmic bias / discrimination issues observed in the Internet industry over the last few years and the lessons learned, key regulations and laws, and evolution of techniques for achieving privacy and fairness in data-driven systems. We motivate the need for adopting a "privacy and fairness by design" approach when developing data-driven AI/ML models and systems for different consumer and enterprise applications. We also focus on the application of privacy-preserving data mining and fairness-aware machine learning techniques in practice, by presenting case studies spanning different LinkedIn applications, and conclude with the key takeaways and open challenges.
Sumo Logic QuickStart Webinar - Get CertifiedSumo Logic
Video: https://www.sumologic.com/online-training/#start
Brand new to Sumo Logic?
Get started with these 5 easy steps. Learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights.
Smart Solutions: Data Analytics to Support Fraud Examinationscorma GmbH
This is an updated slide set based on my ACFE presentation in 2011. The idea is to present a concept to use Data Analytics in Fraud Investigations. For more information feel free to contact me via www.corma.de.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
AI & ML in Cyber Security - Why Algorithms Are DangerousRaffael Marty
Every single security company is talking in some way or another about how they are applying machine learning. Companies go out of their way to make sure they mention machine learning and not statistics when they explain how they work. Recently, that's not enough anymore either. As a security company you have to claim artificial intelligence to be even part of the conversation.
Guess what. It's all baloney. We have entered a state in cyber security that is, in fact, dangerous. We are blindly relying on algorithms to do the right thing. We are letting deep learning algorithms detect anomalies in our data without having a clue what that algorithm just did. In academia, they call this the lack of explainability and verifiability. But rather than building systems with actual security knowledge, companies are using algorithms that nobody understands and in turn discover wrong insights.
In this talk I will show the limitations of machine learning, outline the issues of explainability, and show where deep learning should never be applied. I will show examples of how the blind application of algorithms (including deep learning) actually leads to wrong results. Algorithms are dangerous. We need to revert back to experts and invest in systems that learn from, and absorb the knowledge, of experts.
Similar to Confidential Computing - Analysing Data Without Seeing Data (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
5. Future Value of Data
Data Analytics Without Seeing the Data5 |
time
value
release
Data decay
+
Joining new data
+
New analytics techniques
Uncertain future value
Unknown future risk
21. Paillier Encryp1on
c = gm
rn
modn2
Encryption of m:
D E m1( ).E m2( )modn2
( )= m1 + m2 modn
D E m1( )
m2
modn2
( )= m1m2 modn
Addition of encrypted numbers:
Multiplication of encrypted number by a scalar:
Data Analy.cs Without Seeing the Data 22 |
22. Paillier Encryp1on
c = gm
rn
modn2
Encryption of m:
Addition of encrypted numbers:
Multiplication of encrypted number by a scalar:
gm1
× gm2
= gm1+m2
gm1
( )
m2
= gm1m2
Data Analy.cs Without Seeing the Data 23 |
28. Logis1c Regression
p x;θ( )=
1
1+e−θ.x
L θ( )= yi logp xi;θ( )+ 1− yi( )
i=0
n
∑ log 1− p xi;θ( )( )
Logis.c func.on
Log likelihood
Minimise for :
Evaluate:
θ
Requires “secure log” and “secure inverse” protocol
using Paillier encryp.on
29 | Data Analy.cs Without Seeing the Data
Builds on Han et al. 2010 “Privacy Preserving Gradient Descent Methods”
30. Performance
• Learning
• Learnt models have the same
accuracy as unencrypted
calcula.ons
• “Private learning” is (1000x)
slower due to encrypted
computa.ons. Learning .mes are
several hours.
• Deployment
• A score can be generated in real
.me (<50ms)
• Customer data that contributes to
the score remains private.
��� ���� ����
��������� (����)
���
����
����
���
�������� ���� (�)
�������� �������� ����������
������� ���� ��� ����
31 | Data Analy.cs Without Seeing the Data
31. Scaling
Coordinator
Data Provider 1
Data Provider 2
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
��������
●
●
●
●
●
■
■
■ ■ ■
◆
◆
◆
◆
0 100 200 300 400
Cores
5
10
50
100
500
Minutes
Learning time scaling
● 10,000x10 features
■ 100,000x10 features
◆ 1,000,000x10 features
32 | Data Analy.cs Without Seeing the Data
36. Private Record Linkage
44 |
44
Organisa.on B
Fuzzy
Matcher
Organisa.on A
N1 Analy.cs
A's$PII$data
Name DOB Gender
John/Smith 12/01/82 M
Mark/Gorgon 1/12/90 M
Hanna/Smith 4/02/78 F
… … …
… … …
Juliet/Baker 2/11/72 F
B's$PII$data
Name DOB Gender
Mark.Gorgon 1/12/90 M
Juliet.Baker 2/11/72 F
Andrew.Roberts 4/02/93 M
… … …
… … …
Hanna.Smith 4/02/78 F
A's$Cryptographic$Hashes
Row Key
1 10110110...00101010
2 01110110...11010101
3 10011001...10100110
… …
… …
100000 01101011...00101101
B's$Cryptographic$Hashes
Row Key
1 01110110…11010101
2 01101011...00101101
3 01111000…00110011
… …
… …
100000 10011101...10100111
Shared
Secret Salt
Hasher Hasher
Linkage(Table
Row$A Row$B
1 X
2 1
3 100000
… …
… …
100000 X
Similar in approach to MERLIN - Ranbaduge, Vatsalan, Christen (2015)
Data Analy.cs Without Seeing the Data
41. Current Capabili1es of N1 pla[orm
• Standard data analy.cs
techniques on confiden.al
data:
• Correla.on analysis
• Classifica.on / predic.on
• Regression
• Clustering / outlier detec.on
• Automated private record
linkage
• Fine grained authorisa.on and
access control
Dept 1
Org 2
Comp3
Private record
linkage
Sta.s.cs Classifiers
Anomaly
Detec.on
Private analy.cs
Federated model – No central database
Data is kept local to the source
49 | Data Analy.cs Without Seeing the Data
43. Acknowledgements
51 |
Engineering
Mr. Brian Thorne
Dr. Mentari Djatmiko
Dr. Guillaume Smith
Dr. Wilko Hanecka
Dr. Hamish Ivey-Law
Research
Dr. Richard Nock
Mr. Giorgio Patrini
Dr. Roksana Borelli
Dr. Arik Friedman
Prof. Hugh Durrant-Whyte
Business
Mr. Warren Bradey
Ms. Shelley Copsey
Lead: Dr. Stephen Hardy
Data Analy.cs Without Seeing the Data