During development of machine learning model about 80% of time is used for data preparation and due to data quality issues, especially when there is a need to combine data from structured and unstructured data sources. Development of smart generic data mart can reduce go to production time for new ML models. We will share creative solutions for challenges we encountered during data transfer between DWH and Data Lake, furthermore data preprocessing, development, deployment/orchestration of ML models, using python/pyspark scripts.
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Michael will present an overview of Elastic's machine learning capabilities.
As we know, data science work can be messy, fractured, and challenging as data volumes increase. This session will explore how the Elastic stack can offer a single destination for data ingestion and exploration, time series modeling, and communication of results through data visualizations by focusing on a few sample data sources.
We will also explore new functionality offered by Elastic machine learning, in particular an integration with our APM solution.
Trained as a mathematician, Michael Hirsch started his career with no development experience. His first task - "model the world in a relational database." Over the last 7 years Michael has established himself a data scientist, with a focus on building end-to-end systems. In his career, he has built machine learning powered platforms for clients including Nike, Samsung, and Marvel, and approaches his work with the idea that machine learning is only as useful as the interfaces that users interact with.
Currently, Michael is a Product Engineer for Machine Learning at Elastic. He focuses on tailoring Elastic's ML offering to customer use cases, as well as integrating machine learning capabilities across the entire Elastic Stack.
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
Do you know The Cloud Girl? She makes the cloud come alive with pictures and storytelling.
The Cloud Girl, Priyanka Vergadia, Chief Content Officer @Google, joins us to tell us about Scaleable Data Analytics in Google Cloud.
Maybe, with her explanation, we'll finally understand it!
Priyanka is a technical storyteller and content creator who has created over 300 videos, articles, podcasts, courses and tutorials which help developers learn Google Cloud fundamentals, solve their business challenges and pass certifications! Checkout her content on Google Cloud Tech Youtube channel.
Priyanka enjoys drawing and painting which she tries to bring to her advocacy.
Check out her website The Cloud Girl: https://thecloudgirl.dev/ and her new book: https://www.amazon.com/Visualizing-Google-Cloud-Illustrated-References/dp/1119816327
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Michael will present an overview of Elastic's machine learning capabilities.
As we know, data science work can be messy, fractured, and challenging as data volumes increase. This session will explore how the Elastic stack can offer a single destination for data ingestion and exploration, time series modeling, and communication of results through data visualizations by focusing on a few sample data sources.
We will also explore new functionality offered by Elastic machine learning, in particular an integration with our APM solution.
Trained as a mathematician, Michael Hirsch started his career with no development experience. His first task - "model the world in a relational database." Over the last 7 years Michael has established himself a data scientist, with a focus on building end-to-end systems. In his career, he has built machine learning powered platforms for clients including Nike, Samsung, and Marvel, and approaches his work with the idea that machine learning is only as useful as the interfaces that users interact with.
Currently, Michael is a Product Engineer for Machine Learning at Elastic. He focuses on tailoring Elastic's ML offering to customer use cases, as well as integrating machine learning capabilities across the entire Elastic Stack.
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
Do you know The Cloud Girl? She makes the cloud come alive with pictures and storytelling.
The Cloud Girl, Priyanka Vergadia, Chief Content Officer @Google, joins us to tell us about Scaleable Data Analytics in Google Cloud.
Maybe, with her explanation, we'll finally understand it!
Priyanka is a technical storyteller and content creator who has created over 300 videos, articles, podcasts, courses and tutorials which help developers learn Google Cloud fundamentals, solve their business challenges and pass certifications! Checkout her content on Google Cloud Tech Youtube channel.
Priyanka enjoys drawing and painting which she tries to bring to her advocacy.
Check out her website The Cloud Girl: https://thecloudgirl.dev/ and her new book: https://www.amazon.com/Visualizing-Google-Cloud-Illustrated-References/dp/1119816327
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
We will present our O365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DataStax Enterprise on azure.
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
ODSC East virtual presentation - The best machine learning, and advanced analytics projects are often stopped when it comes time to move into large scale production, preventing them from ever impacting the business in a meaningful way. Hundreds of hours of work may never get put to use.
Python is rapidly becoming the language of choice for scientists and researchers of many types to build, test, train and score models. But when data science models need to go into production, challenges of performance and scale can be a huge roadblock.
By combining a Python application with an underlying massively parallel (MPP) database, Python users can achieve a simplified path to production. An MPP database also allows you to do data preparation and data analysis at far greater speeds, accelerating development and testing as well as production performance. It also allows greater numbers of concurrent jobs to run, while also continuously loading data for IoT or other streaming use cases.
Analyze data in the database where it sits, rather than first moving it to another framework, then analyzing it, then moving the results, taking multiple performance hits from both CPU and IO for every move and transformation.
In this talk, you will learn about combination architectures that can get your work into production, shorten development time, and provide the performance and scale advantages of an MPP database with the convenience and power of Python. Use case examples use the open source Vertica-Python project created by Uber with contributions from Twitter, Palantir, Etsy, Vertica, Kayak and Gooddata.
• Associate Consultant pursuing Executive MBA with 3+ years of experience in Healthcare ,Banking domain & software development, implementation in the areas of Data warehousing using IBM web sphere Data stage 8.1 tool and IBM Info Sphere Data stage 8.7,ETL Architecture, enhancement, maintenance, Production support, Data Modeling, Data profiling, Reporting including Business requirement, system requirement gathering.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
In Data Engineer's Lunch #60, Rahul Singh, CEO here at Anant, will discuss modern data processing/pipeline approaches.
Want to learn about modern data engineering patterns & practices for global data platforms? A high-level overview of different types, frameworks, and workflows in data processing and pipeline design.
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Joachim Schlosser
In einer Gesellschaft, in der das Sammeln von personenbezogenen Daten mittlerweile alltäglich geworden ist, ist es nicht weiter verwunderlich, dass auch der innovative Maschinenbauer Daten sammelt, wo er nur kann. Produktdaten, Maschinendaten, Statistikdaten – in einer durchschnittlichen Produktionsanlage fallen bereits heute jeden Tag Gigabytes an Daten an. „Big Data“ wurde eines der Schlagworte der Industrie 4.0.
Doch was verspricht man sich davon? Welche Information steckt in den aufgezeichneten Maschinen- und Produktdaten? Und wie erfolgt die Auswertung?
Im Rahmen des Vortrags wird aufgezeigt, wie Unternehmen auf Basis einer etablierten Plattform wie MATLAB® ihre Auswertealgorithmen entwickeln, testen und ausrollen können. Die kontinuierliche Auswertung selbst erfolgt dann wahlweise auf einem Anlagenserver oder aber auch in Echtzeit direkt an der Maschine. Veranschaulicht wird dies anhand von Beispielen aus der Praxis.
Doch neben der gesammelten Daten kommt auch den Steuerungseinheiten in der Produktion in der Industrie 4.0 eine größere Bedeutung zu.
Wenn Werkstücke demnächst selbst wissen, wo sie im Produktionsablauf hin möchten und welcher Verarbeitungsschritt ihnen angedeihen soll, dann bedeutet das auch für die einzelnen Komponenten und Module in Produktion und Logistik ein mehr an Funktionalität, da sie auf diese Eingaben ebenfalls reagieren sollen.
Wie stellen Sie sicher, dass diese zusätzliche Funktionalität nicht zu Lasten der Energiebilanz gehen? Wie fahren Sie die Motoren und anderen aktiven Komponenten Ihrer Fertigung so, dass sie flexibel auf veränderte Routen der Werkstücke reagieren und dennoch im optimalen Bereich fahren?
Mehr denn je brauchen Sie gesteuerte und geregelte Komponenten und Module. Das sollte schon seit Industrie 3.0 vorhanden sein, jedoch ist auch hier noch viel ganz konkretes Potential zur Steigerung von Produktivität und Einsparung von Energie und Produktionszeit vorhanden.
Sie sehen im Vortrag, wie Sie ihre Komponenten besser beschalten, dass die vernetzten dynamischen Anforderungen von Industrie 4.0 lokal effizient umgesetzt werden können.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
We will present our O365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DataStax Enterprise on azure.
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
ODSC East virtual presentation - The best machine learning, and advanced analytics projects are often stopped when it comes time to move into large scale production, preventing them from ever impacting the business in a meaningful way. Hundreds of hours of work may never get put to use.
Python is rapidly becoming the language of choice for scientists and researchers of many types to build, test, train and score models. But when data science models need to go into production, challenges of performance and scale can be a huge roadblock.
By combining a Python application with an underlying massively parallel (MPP) database, Python users can achieve a simplified path to production. An MPP database also allows you to do data preparation and data analysis at far greater speeds, accelerating development and testing as well as production performance. It also allows greater numbers of concurrent jobs to run, while also continuously loading data for IoT or other streaming use cases.
Analyze data in the database where it sits, rather than first moving it to another framework, then analyzing it, then moving the results, taking multiple performance hits from both CPU and IO for every move and transformation.
In this talk, you will learn about combination architectures that can get your work into production, shorten development time, and provide the performance and scale advantages of an MPP database with the convenience and power of Python. Use case examples use the open source Vertica-Python project created by Uber with contributions from Twitter, Palantir, Etsy, Vertica, Kayak and Gooddata.
• Associate Consultant pursuing Executive MBA with 3+ years of experience in Healthcare ,Banking domain & software development, implementation in the areas of Data warehousing using IBM web sphere Data stage 8.1 tool and IBM Info Sphere Data stage 8.7,ETL Architecture, enhancement, maintenance, Production support, Data Modeling, Data profiling, Reporting including Business requirement, system requirement gathering.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
In Data Engineer's Lunch #60, Rahul Singh, CEO here at Anant, will discuss modern data processing/pipeline approaches.
Want to learn about modern data engineering patterns & practices for global data platforms? A high-level overview of different types, frameworks, and workflows in data processing and pipeline design.
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Joachim Schlosser
In einer Gesellschaft, in der das Sammeln von personenbezogenen Daten mittlerweile alltäglich geworden ist, ist es nicht weiter verwunderlich, dass auch der innovative Maschinenbauer Daten sammelt, wo er nur kann. Produktdaten, Maschinendaten, Statistikdaten – in einer durchschnittlichen Produktionsanlage fallen bereits heute jeden Tag Gigabytes an Daten an. „Big Data“ wurde eines der Schlagworte der Industrie 4.0.
Doch was verspricht man sich davon? Welche Information steckt in den aufgezeichneten Maschinen- und Produktdaten? Und wie erfolgt die Auswertung?
Im Rahmen des Vortrags wird aufgezeigt, wie Unternehmen auf Basis einer etablierten Plattform wie MATLAB® ihre Auswertealgorithmen entwickeln, testen und ausrollen können. Die kontinuierliche Auswertung selbst erfolgt dann wahlweise auf einem Anlagenserver oder aber auch in Echtzeit direkt an der Maschine. Veranschaulicht wird dies anhand von Beispielen aus der Praxis.
Doch neben der gesammelten Daten kommt auch den Steuerungseinheiten in der Produktion in der Industrie 4.0 eine größere Bedeutung zu.
Wenn Werkstücke demnächst selbst wissen, wo sie im Produktionsablauf hin möchten und welcher Verarbeitungsschritt ihnen angedeihen soll, dann bedeutet das auch für die einzelnen Komponenten und Module in Produktion und Logistik ein mehr an Funktionalität, da sie auf diese Eingaben ebenfalls reagieren sollen.
Wie stellen Sie sicher, dass diese zusätzliche Funktionalität nicht zu Lasten der Energiebilanz gehen? Wie fahren Sie die Motoren und anderen aktiven Komponenten Ihrer Fertigung so, dass sie flexibel auf veränderte Routen der Werkstücke reagieren und dennoch im optimalen Bereich fahren?
Mehr denn je brauchen Sie gesteuerte und geregelte Komponenten und Module. Das sollte schon seit Industrie 3.0 vorhanden sein, jedoch ist auch hier noch viel ganz konkretes Potential zur Steigerung von Produktivität und Einsparung von Energie und Produktionszeit vorhanden.
Sie sehen im Vortrag, wie Sie ihre Komponenten besser beschalten, dass die vernetzten dynamischen Anforderungen von Industrie 4.0 lokal effizient umgesetzt werden können.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdfDataScienceConferenc1
In this talk, I'll journey from my time as a Research Assistant at the Bernoulli Institute, delving into the classification of neurodegenerative diseases, to my encounters with groundbreaking biotechnology and AI companies like Proteinea, AlProtein, Rology, and Natrify in Egypt. These innovative ventures are reshaping industries from their Egyptian hub. Join me as I illuminate the transformative power of this thriving ecosystem, showcasing Egypt's remarkable strides in biotech and AI on the global stage.
Building big scale data product doesn't rely only on sophisticated modeling. It also requires an agile methodology, iterative research & development process, versatile big data stack, and a value-oriented mindset. I'll discuss how we -at Dsquares- build big-scale AI product that leverages clients' data from different industries to deliver business-critical value to the end customer. I'll cover the process of product discovery, R&D tasks for unsolved problems, and mapping business requirements into big data technical requirements.
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptxDataScienceConferenc1
Innovation thrives at the intersection of data and creativity. While brainstorming has traditionally fueled the generation of new ideas, leveraging data alongside creative techniques empowers organizations to develop more effective and impactful innovations
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...DataScienceConferenc1
In today's fast-paced and competitive business environment, harnessing the power of data is essential for staying ahead. Building a data-driven culture within an organization is not just a strategic advantage, but a necessity for those who wish to thrive and innovate. In this insightful talk, our esteemed speaker, a Chief Data Scientist with a decade of experience in the financial services sector, will unravel the complexities of embedding data into the DNA of your organization. The speaker will explore the key tenets of establishing a data-centric mindset, the importance of executive support, and the need for enhancing data literacy across the company. Practical solutions and real-world examples will be provided, demonstrating how to overcome obstacles and successfully integrate a data-driven approach. Attendees will learn strategies for empowering every team member to use data effectively and how to leverage technology to facilitate this cultural shift. The session promises to be a guide for those looking to champion data within their organizations, offering actionable insights for transformation.
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdfDataScienceConferenc1
The use of Artificial Intelligence (AI) is rapidly transforming the recruitment landscape. This talk explores the various ways AI is being used in hiring, from candidate sourcing and screening to skills assessments and interview preparation. We'll discuss the benefits of AI, such as increased efficiency and reduced bias, but also address potential drawbacks like ethical considerations and the human touch.
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...DataScienceConferenc1
In today's business landscape, data strategy plays a pivotal role in driving innovation within business models. This talk explores how organizations can leverage data effectively to transform their operations, products, and services.
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...DataScienceConferenc1
Delve into the unexplored potential of scene graphs in the realms of Generative AI and innovative data product development. This session unveils the intricate role of scene graphs in generating realistic content and driving advancements in computer vision, and automated content creation. Join us for a journey into the intersection of scene graphs and cutting-edge AI, gaining insights into their pivotal role in reshaping the landscape of data-centric innovation. This talk is your gateway to understanding how structured visual representations are shaping the future of AI and revolutionizing the creation of data-driven solutions.
This presentation will delve into the transformative role of Artificial Intelligence in reshaping social media landscapes. We'll explore cutting-edge AI technologies that are integrating with social media platforms, altering how we interact, consume content, and perceive digital communities. The talk will also cast a visionary eye towards future trends, discussing potential impacts on user experience, content creation, digital marketing, and privacy concerns. Join us to uncover how AI is not just a tool but a game-changer in the evolving narrative of social media.
Supercharge your software development with Azure OpenAI Service! Azure cloud platform provides access to cutting-edge AI models for diverse tasks. Explore different models for generating content, translating languages, and even generating code. Leverage data grounding to fine-tune models for your specific needs. Discover how Azure OpenAI Service accelerates innovation and injects intelligence into your software creations.
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...DataScienceConferenc1
In this insightful talk, we'll embark on a journey from the origins of programming in 1883 and the conceptualization of AI in the 1950s, to the current explosion of AI applications reshaping our world. We'll unravel why AI has surged to prominence in the last decade, driven by unprecedented data generation and significant hardware advancements. With examples ranging from individual email filtering to complex supply chain optimizations, we'll explore AI's pervasive impact across various sectors including finance, manufacturing, healthcare, and media. The talk will address the challenges of AI implementation, such as the high cost of AI teams and the quest for universally applicable models, while highlighting the promising horizon of no-code AI platforms democratizing access. Furthermore, we'll delve into the ethical dimensions of AI, from biases to privacy concerns, and the pressing question of AI's potential to replace human roles. Lastly, we'll discuss the transformative potential of language models and generative AI, underscoring the importance of understanding and integrating AI into our lives and businesses for a future that's both scalable and sustainable.
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...DataScienceConferenc1
Transitioning to a career in data science requires careful planning and smart choices. In this session, I'll help you understand how to switch to data science. Using my own experiences and what I've learned from the industry, we'll break down the important steps for a successful transition. We'll cover everything from figuring out which skills you can carry over to learning the technical stuff and connecting with other professionals. By the end, you'll have the knowledge and tools you need to start your journey into data science, whether you're a seasoned professional looking for something new or just starting out in the field.
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...DataScienceConferenc1
With the continuous growth of the digital environment, the risks in the online realm also increase. This calls for strong security measures to safeguard valuable information and essential systems. Artificial Intelligence (AI) has become a powerful weapon in the fight against cyber threats. This talk presents a thorough examination of the most recent algorithms and applications of artificial intelligence in the field of cybersecurity.
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptxDataScienceConferenc1
What is Generative AI and how does it work? Could it eventually replace us? Let's delve deep into the heart of this groundbreaking technology and uncover the truths and myths surrounding Generative AI and how to make the most of it.
Background: The digital twin paradigm holds great promise for healthcare, most importantly efficiently integrating many disparate healthcare data sources and servicing complex tasks like personalizing care, predicting health outcomes, and planning patient care, even though many technical and scientific challenges remain to be overcome. Objective: As part of the QUALITOP project, we conducted a comprehensive analysis of diverse healthcare data, encompassing both prospective and retrospective datasets, along with an in-depth examination of the advanced analytical needs of medical institutions across five European Union countries. Through these endeavors, we have systematically developed and refined a formal Personal Medical Digital Twin (PMDT) model subjected to iterative validation by medical institutions to ensure its applicability, efficacy, and utility. Findings: The PMDT is based on an interconnected set of expressive knowledge structures that are calibrated to capture an individual patient’s psychosomatic, cognitive, biometrical and genetic information in one personal digital footprint in a manner that allows medical professionals to run various models to predict an individual’s health issues over time and intervene early with personalized preventive care.Conclusion: At the forefront of digital transformation, the PMDT emerges as a pivotal entity, positioned at the convergence of Big Data and Artificial Intelligence. This paper introduces a PMDT environment that lays the foundation for the application of comprehensive big data analytics, continuous monitoring, cognitive simulations, and AI techniques. By integrating stakeholders across the care continuum, including patients, this system enables the derivation of insights and facilitates informed decision-making for personalized preventive care.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic
1. Smart approach in development and
deployment process for various ML models
Jelena Pekez (Advanced Analytics Team Lead)
Miloš Josifović (Big Data Architect)
Danijel Ilievski (Senior ML Engineer)
2. Comtrade System Integration
Introduction
→Since 87% of models are never deployed, all steps should be planned at the
beginning of Data Science Lifecycle (pipeline):
1. Manage
2. Develop
3. Deploy
4. Monitor
→The first goal is to reduce go to production time for new ML models
with development of Smart Generic Data Mart(s).
→With Smart Data Mart(s) we can prototype ML model and evaluate feasibility.
→The final goal is to generate Production Models and easily orchestrate them.
2
Results
Interpretation
Modeling
Data
Preprocess
Data mart design
ADS
Problem
Formulation
Deployment
PROD.MODEL
3. Comtrade System Integration
3
ADS smart development to support all future ML models
→Planning DataMart for creation of first ML model in a program takes exhaustive time:
• Collect at high-level all possible future use-cases
• Come up with all relevant and available data sources
• Customer’s activities which company has interest in
• Combine data from structured and unstructured data sources
• Extensive feature engineering (text processing, normalization, binning,…)
• Complying with GDPR regulation
• Define proper access rights on selected Data Mart(s)
• Resolving data quality issues at the very beginning will reduce endless reloads
FornextMLmodeldatascientistscanspendmoretimeoncreativeactivitiesusingdevelopedAnalyticalDatamarts/Sets(ADS)
4. Comtrade System Integration
Smart generic data mart(s)
→Creating Multipurpose Data Marts:
• Generate list of target features and relevant target events
• Design it so new events can be easily added
• Eliminate data that have no business/use-case value
• Filter out system records - clean data
• Make initial (starting) base table/s - what is definition of customer?
• Aggregate data to different granularity levels to catch behavior trends
• Feature Engineering do indeed make a difference!
4
Generate quickly and easily new ML training datasets
5. Comtrade System Integration
Data Science requires domain knowledge
makes a big difference
→How much domain knowledge do I need? Depends.
→Domain knowledge is critical for data preparation, productization and orchestration
→Which data points add value?
→Domain knowledge is necessary in data pre-processing:
• Outlier detection, feature importance, model selection, model evaluation stage...
5
DATA SCIENCE
DOMAIN
KNOWLEDGE
MATH, STATS
& ML
COMPUTER
SCIENCE
You have to get best of both worlds!
6. Comtrade System Integration
Control your data mart(s) in production
→Steps in data pipeline for data quality check:
• Missing data vs Loaded data - aggregations
• Duplicates – the same records were repeated
• Relative change threshold - increment or decrement in the number of records
• Statistical expected range
• Data drift – target variable distribution
6
Data
Pipeline
7. Comtrade System Integration
Example how Generic Data Set can help to focus on
Data Science – Transfer between DWH and Data Lake
→Data on two platforms (DWH – SQL database, Data Lake – Hadoop)
→Data can be transferred among databases:
• Through SQL federation / DB link – with certain specifics/products compatibility
• Via Spark engine (PySpark) to Hadoop
→Aim is to simplify data transfer between platforms so,
Data Scientist can do it on their own, without:
• Dealing with Spark’s jobs directly
• Manage Hadoop security (Kerberos, read-write permissions, etc.)
7
8. Comtrade System Integration
Speed up writting SQL queries
→ADS [GENERATE SQL QUERY] Training/Scoring table
→Query automation for training table
→ Input for Python script: e.g. of Python script:
8
SCHEMA SOURCE VAR_IN VAR_OUT FUNCTIONS
PERIOD
S
ZERO
EXCLUDE
ADS DS_PAYMENT TOTAL_PAYMENT_AMT
TOTAL_PAYMENT_AM
T
[MAX, AVG/P] [3, 6] 1
ADS DS_PAYMENT TOTAL_PAYMENT_CNT
TOTAL_PAYMENT_CN
T
[SUM] [1] 1
ADS DS_PAYMENT MAX_PAYMENT_AMT MAX_PAYMENT_AMT [MAX] [3] 1
ADS DS_PAYMENT MIN_PAYMENT_AMT MIN_PAYMENT_AMT [MIN] [3] 1
ADS DS_PAYMENT ADD_PAYMENT_CNT ADD_PAYMENT_CNT [AVG/P] [6] 1
ADS DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR [SUM] [1] 1
ADS
DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR
[AVG/P, MAX,
MIN]
[3, 6] 1
ADS
DS_USAGE USAGE_OUT_IN_PACK_DUR
USAGE_OUT_IN_PACK
_DUR
[SUM] [1] 1
ADS
DS_USAGE
NVL(USAGE_OUT_REG_INT_DUR,
0) +
NVL(USAGE_OUT_INT_DUR,0) USAGE_OUT_INT_DUR
[AVG/P] [6] 1
for i, line in enumerate(variables):
for i2, k in enumerate(line[2]): #funkcija
for i3, kk in enumerate(line[3]): #period
if (i == len(variables) - 1) & (i2 == len(line[2])-1) & (i3 == len(line[3])-1):
zarez = ''
else:
zarez = ','
#KREIRA AGREGACIONU KOLONU, npr. AVG(FIELD_NAME) AS NEW_FIELD_NAME
divider = ''
if 'AVG/P' == str.upper(k):
func1 = 'SUM'
func2 = '_' + 'AVG'
divider = '/' + str(kk)
elif ('SUM' == str.upper(k)) & (kk == '1'):
func1 = 'SUM'
func2 = ''
else:
func1 = k
func2 = '_' + k
query += (func1 + '(' + line[1] + '_' + str(kk) + 'M' + ')' + divider + ' AS ' + line[1] + func2 + '_' + str(kk) + 'M' + zarez + ' n’)
…
for i, line in enumerate(variables):
for i2, line2 in enumerate(line[3]):
if (i == len(variables) - 1) & (i2 == len(line[3])-1):
zarez = ''
else:
zarez = ','
if line[4] == 1:
zero_rule = 'AND {varijabla} <> 0'.format(varijabla = line[0])
else:
zero_rule = ''
query += ("CASE WHEN TIME_ID BETWEEN ADD_MONTHS('{datum_place}', {vreme2}) AND
'{datum_place}' {zero_rule} THEN {varijabla} ELSE
NULL END AS
{varijabla2}_{vreme}M{zarez_place}".format(varijabla = line[0],
varijabla2 = line[1], datum_place = datum, vreme2 = -1 * (int(line2) - 1),
zero_rule=zero_rule, vreme = line2, zarez_place = zarez))+ ' n'
query += ("FROMn
9. Comtrade System Integration
Develop phase - Devote more time to the creative side
→Improve ML traditional development processes:
• Benefit from pre-trained models (deep learning – mainly image recognition)
• Automated Machine learning (AutoML) – pretty good in supervised ML
9
→Auto ML:
• Optimize DS workload or lack of experience
• Processes tasks like Feature Selection, Data Preprocessing, Hyperparameter Optimization,
Model/Algorithm Selection
• Let you focus more on the data side
• Is no silver bullet, it is more exploration tool rather than an optimal model generation tool
MLBox, Auto-Sklearn, TPOT, H2O AutoML, Auto Keras, Auto PyTorch, Google Cloud AutoML, DataRobot, etc.
10. Comtrade System Integration
Deploy phase - don’tgetanyvalueoutofamodelsittingonsomeonecomputer
→Phase where model is transferred to a production environment.
→Same best-practice principles and design patterns for software also apply to ML models
→ML model should be deployed as part of existing data pipeline
→Output of ML model should be monitored for bias
→ML model in deploy phase:
• Registered in appropriate repository
• Passed testing
• Model artifacts are retained
→Validate model Publish model Deliver model
→Don’t update Python libraries before proper testing on development environment 😊 10
11. Comtrade System Integration
Deploy phase – more than one ML model
12
→Model registry:
• Place for all trained/production-ready models (with version control)
• Alternative models as backup
• All model artifacts, model dependencies, evaluation metrics, documentation
• Which dataset was used for training / model lineage
• Log performance details of the model and comparison with other models
• Tracking models during whole time (training, staging and production)
→Model registry enables faster deployment of your models or retrain current ones
→Shared by multiple team members (team collaboration)
→Tie up business rules and output from production model
→Consume the model through API integration
DANIJEL
During deployment in large organizations, we have to orchestrate more than one ML model and best thing is to have in mind that at very beginning of projects that we will have more ml models in future, so organize everything in that manner that can support adding new models easily.
…
- Since very beginning special focus in Data Science Lifecycle should be on data quality and production.
Foundation for more models in a future:
Development of analytical dataset for future models development we can observe like a different project.
JELENA
So if we go more in details….
Kada se razvija model, focus na pripremi podataka
–Organize DB tables considering performance and optimization
Analiza dodavanje kolona, bitnih izvora
Osmislite izvore, target tabele, kako organiz. Tabel po pitanju perform, I logike, imati higtlevel koji su use case-ove.
JELENA
POMENUTI:
Organize DB tables considering performance and optimization
Feature Engineering - isn't about generating a higher quantity of new features. It's about the quality of the features created.
-DANIJEL ILI JELENA
Doman knowledge cannot be optimized.
- Make an instruction file with field names and action how to handling null:
Constant value, Max(), Min(), Mean(), Nearby value, Regression, Delete record
- Domain knowledge will allow you to take the impact of your machine learning skills to a much higher level of significance.
--------------
--Random forests, for example, can handle heterogeneous data types right out of the box.
As Data Scientist with domain knowledge you will have answer on question Which data points add value? And you just need to find them.
DANIJEL
MILOS
Benefit / suggestion:
Parallel execution
No temp data on initial database
Fast transfer
Careful about data types specified on table level
DANIJEL
JELNEA DO KRAJA
Efficiently automate all regular, manual, and tedious workloads of ML implementations
„Fails short“ for Feature Engineering.
Can easily overfit (watch for label distribution, how many outliers, etc.