Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
Système de recommandations de produits sur un site marchand par Koby KARP, Data Scientist (Equancy) & Hervé MIGNOT, Partner at Equancy
La recommandation reste un outil clé pour la personnalisation des sites marchands et le sujet est loin d’être épuisé. La prise en compte de la particularité d’un marché peut nécessité d’adapter le traitement et les algorithmes utilisés. Après une revue des techniques de recommandations, nous présenterons la démarche spécifique que nous avons adopté. Le système a été développé sous Spark pour la préparation des données et le calcul des modèles de recommandations. Une API simple et son service ont été développé pour délivrer les recommandations aux applications clientes.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Brief introduction to DTrace technologies within OpenSolaris/Solaris 10 and DTrace probes within Apache, PHP and MySQL can provide end to end dynamic tracing of your Drupal based web site..
Streaming Inference with Apache Beam and TFXDatabricks
In this session we will be using an LSTM Encoder-Decoder Anomaly Detection model as an example, to show the building and retraining of a model which uses the tfx-bsl package to run continuous inference. We will also emphasize the importance of the hermetic seal between training and inference paths.
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
В докладе постараюсь рассказать об особенностях промышленного использования моделей машинного обучения. Какие особенности возникают при обучении распределенных моделей. Как понять и предвидеть поведение модели с увеличением количества данных. Как настраивать и выбирать аккуратные модели с учетом ограниченных ресурсов кластера.
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.
This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.
Assessing Graph Solutions for Apache SparkDatabricks
Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.
In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Amazon Web Services
The Apache MXNet deep learning framework is used for developing, training, and deploying diverse AI applications, including computer vision, speech recognition, and natural language processing at scale. In this session, learn how to get started with MXNet on the Amazon SageMaker machine learning platform. Hear from Workday about how they built computer vision and natural language processing (NLP) models using MXNet to automatically extract information from paper documents, such as expense receipts and populate data records. Workday also shares its experience using Sockeye, an MXNet toolkit for quickly prototyping sequence-to-sequence NLP models.
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Chetan Khatri
What is Data Science?
What is Machine Learning, Deep Learning, and AI?
Motivation
Philosophy of Artificial Intelligence (AI)
Role of AI in Daily life
Use cases/Applications
Tools & Technologies
Challenges: Bias, Fake Content, Digital Psychography, Security
Detect Fake Content with “AI”
Learning Path
Career Path
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Chetan Khatri
Pragmatic presentation on Penetration testing for Data-Driven Platforms.
Agenda:
- Motivation
- Information Security - Ethics.
- Encryption
- Authentication
- Information Security & Potential threats with Open Source World.
- Find vulnerabilities.
- Checklist before using any Open Source library.
- Vulnerabilities report.
- Penetration Testing for Data-Driven Developments.
More Related Content
Similar to TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
Système de recommandations de produits sur un site marchand par Koby KARP, Data Scientist (Equancy) & Hervé MIGNOT, Partner at Equancy
La recommandation reste un outil clé pour la personnalisation des sites marchands et le sujet est loin d’être épuisé. La prise en compte de la particularité d’un marché peut nécessité d’adapter le traitement et les algorithmes utilisés. Après une revue des techniques de recommandations, nous présenterons la démarche spécifique que nous avons adopté. Le système a été développé sous Spark pour la préparation des données et le calcul des modèles de recommandations. Une API simple et son service ont été développé pour délivrer les recommandations aux applications clientes.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Brief introduction to DTrace technologies within OpenSolaris/Solaris 10 and DTrace probes within Apache, PHP and MySQL can provide end to end dynamic tracing of your Drupal based web site..
Streaming Inference with Apache Beam and TFXDatabricks
In this session we will be using an LSTM Encoder-Decoder Anomaly Detection model as an example, to show the building and retraining of a model which uses the tfx-bsl package to run continuous inference. We will also emphasize the importance of the hermetic seal between training and inference paths.
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
В докладе постараюсь рассказать об особенностях промышленного использования моделей машинного обучения. Какие особенности возникают при обучении распределенных моделей. Как понять и предвидеть поведение модели с увеличением количества данных. Как настраивать и выбирать аккуратные модели с учетом ограниченных ресурсов кластера.
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.
This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.
Assessing Graph Solutions for Apache SparkDatabricks
Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.
In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Amazon Web Services
The Apache MXNet deep learning framework is used for developing, training, and deploying diverse AI applications, including computer vision, speech recognition, and natural language processing at scale. In this session, learn how to get started with MXNet on the Amazon SageMaker machine learning platform. Hear from Workday about how they built computer vision and natural language processing (NLP) models using MXNet to automatically extract information from paper documents, such as expense receipts and populate data records. Workday also shares its experience using Sockeye, an MXNet toolkit for quickly prototyping sequence-to-sequence NLP models.
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Similar to TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. (20)
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Chetan Khatri
What is Data Science?
What is Machine Learning, Deep Learning, and AI?
Motivation
Philosophy of Artificial Intelligence (AI)
Role of AI in Daily life
Use cases/Applications
Tools & Technologies
Challenges: Bias, Fake Content, Digital Psychography, Security
Detect Fake Content with “AI”
Learning Path
Career Path
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Chetan Khatri
Pragmatic presentation on Penetration testing for Data-Driven Platforms.
Agenda:
- Motivation
- Information Security - Ethics.
- Encryption
- Authentication
- Information Security & Potential threats with Open Source World.
- Find vulnerabilities.
- Checklist before using any Open Source library.
- Vulnerabilities report.
- Penetration Testing for Data-Driven Developments.
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
Scala Toronto July 2019 event at 500px.
Pure Functional API Integration
Apache Spark Internals tuning
Performance tuning
Query execution plan optimisation
Cats Effects for switching execution model runtime.
Discovery / experience with Monix, Scala Future.
No more struggles with Apache Spark (PySpark) workloads in production, Chetan Khatri, Data Science Practice Leader.
Accionlabs India. PyconLT’19, May 26 - Vilnius Lithuania
Fossasia ai-ml technologies and application for product development-chetan kh...Chetan Khatri
Train at GPU and Inference at Mobile, Artificial Intelligence / Machine learning Technologies and Applications for AI Driven Product Development. Talk at FOSSASIA 2018, Singapore
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.
1. TransmogrifAI
Automate Machine Learning Workflow with the power of Scala and
Spark at massive scale.
@khatri_chetanBy: Chetan Khatri
Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France
2. About me
Lead - Data Science @ Accion labs India Pvt. Ltd.
Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark
HBase Connectors.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
Advisor - Data Science Lab, University of Kachchh, India.
M.Sc. - Computer Science from University of Kachchh, India.
3. Agenda
● What is TransmogrifAI ?
● Why you need TransmogrifAI ?
● Automation of Machine learning life Cycle - from development to deployment.
○ Feature Inference
○ Transformation
○ Automated Feature validation
○ Automated Model Selection
○ Hyperparameter Optimization
● Type Safety in Spark, TransmogrifAI.
● Example: Code - Titanic kaggle problem.
4. What is TransmogrifAI ?
● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018
● An end to end automated machine learning workflow library for structured
data build on top of Scala and SparkML.
Build with
5. What is TransmogrifAI ?
● TransmogrifAI helps extensively to automate Machine learning model life
cycle such as Feature Selection, Transformation, Automated Feature
validation, Automated Model Selection, Hyperparameter Optimization.
● It enforces compile-time type-safety, modularity, and reuse.
● Through automation, It achieves accuracies close to hand-tuned models with
almost 100x reduction in time.
6. Why you need TransmogrifAI ?
AUTOMATION
Numerous Transformers
and Estimators.
MODULARITY AND
REUSE
Enforces a strict separation
between ML workflow
definitions and data
manipulation.
COMPILE TIME TYPE
SAFETY
Workflow built are Strongly
typed, code completion
during development and
fewer runtime errors.
TRANSPARENCY
Model insights leverage
stored feature metadata
and lineage to help debug
models.
Features
7. Why you need TransmogrifAI ?
Use TransmogrifAI if you need a machine learning library to:
● Build production ready machine learning applications in hours, not months
● Build machine learning models without getting a Ph.D. in machine learning
● Build modular, reusable, strongly typed machine learning workflows
More read documentation: https://transmogrif.ai/
8. Why Machine Learning is hard ?! Really! ...
For example, this may be using a linear
classifier when your true decision
boundaries are non-linear.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
9. Why Machine Learning is hard ?! Really! ...
fast and effective debugging is the skill that is most required for
implementing modern day machine learning pipelines.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
10. Real time Machine Learning takes time to Productionize
TransmogrifAI Automates entire ML
Life Cycle to accelerate developer’s
productivity.
11. Under the Hood
Automated Feature Engineering
Automated Feature Selection
Automated Model Selection
12. Automated Feature Engineering
Automatic Derivation of new features based on existing features.
Email Phone Age Subject Zip Code DOB Gender
Email is
Spam
Country
Code [0-20]
[21-30]
[ > 30]
Stop
words
Top
terms
(TF-IDF)
Detect
Language
Average Income
House Price
School Quality
Shopping
Transportation
To
Binary
Age
Day of Week
Week of Year
Quarter
Month
Year
Hour
Feature Vector
13. Automated Feature Engineering
● Analyze every feature columns and compute descriptive statistics.
○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation.
● Handle Missing values / Noisy values.
○ Ex. fillna by Mean / Avg / near by values.
patient_details = patient_details.fillna(-1)
data['City_Type'] = data['City_Type'].fillna('Z')
imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False)
data_total_imputed = imp.fit_transform(data_total)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)
14. Automated Feature Engineering
● Does features have acceptable ranges / Does it contain valid values ?
● Does that feature could be leaker ?
○ Is it usually filled out after predicted field is ?
○ Is it highly correlated with the predicted field ?
● Does that feature is Outlier ?
15. Automated Feature Selection / Data Pre-processing
● Data Type of Features, Automatic Data Pre-processing.
○ MinMaxScaler
○ Normalizer
○ Binarizer
○ Label Encoding
○ One Hot Encoding
● Auto Data Pre-Processing based on chosen ML Model.
● Algorithm like XGBoost, specifically requires dummy encoded data while
algorithm like decision tree doesn’t seem to care at all (sometimes)!
16. Auto Data Pre-processing
● Numeric - Imputation, Track Null Value, Log Transformation for large range,
Scaling, Smart Binning.
● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy
Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category
Embedding.
● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis,
Language Detection.
● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week,
Month, Year).
17. Auto Selection of Best Model with Hyper Parameter
Tuning
● Machine Learning Model
○ Learning Rate
○ Epoc
○ Batch Size
○ Optimizer
○ Activation Function
○ Loss Function
● Search Algorithms to find best model and optimal hyper parameters.
○ Ex. Grid Search, Random Search, Bandit Methods
21. Type Safety: Integration with Apache Spark and Scala
● Modular, Reusable, Strongly typed Machine learning workflow on top of
Apache Spark.
● Type Safety in Apache Spark with DataSet API.
23. Unification of APIs in Apache Spark 2.0
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
DataSet [T]
24. Why Dataset ?
● Strongly Typing.
● Ability to use powerful lambda functions.
● Spark SQL’s optimized execution engine (catalyst, tungsten).
● Can be constructed from JVM objects & manipulated using Functional.
● transformations (map, filter, flatMap etc).
● A DataFrame is a Dataset organized into named columns.
● DataFrame is simply a type alias of Dataset[Row].
25. DataFrame API Code
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
26. DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
27. Why Structure APIs ?
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
28. Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
29. Dataset API in Spark 2.x
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
30. Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
35. A Case Story - Functional Flow - Spark as a SaaS
User
Interface
Build workflow
- Source
- Target
- Transformations
- filter
- aggregation
- Joins
- Expressions
- Machine Learning
Algorithms
Store Metadata
of workflow in
Document based
NoSQL
Ex. MongoDB
ReactiveMongo
Scala / Spark
Job Reads
Metadata from
NoSQL ex.
MongoDB
Run on the
Cluster
Schedule Using
Airflow
SparkSubmit
Operator
36. A Case Story - High Level Technical Architecture - Spark as a SaaS
User
Interface
Middleware
Akka HTTP
Web
Service’s
43. Questions ?
Thank you!
Big Thanks to Scala.IO Organizers and Scala France Community!
@khatri_chetan
chetan.khatri@live.com
44. References
[1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows
on Spark from Salesforce Engineering
[online] https://transmogrif.ai
[2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator
[online] https://github.com/rssanders3/airflow-spark-operator-plugin
[3] Apache Spark - Unified Analytics Engine for Big Data
[online] https://spark.apache.org/
[4] Apache Livy
[online] https://livy.incubator.apache.org/
[5] Zayd's Blog - Why is machine learning 'hard'?
[online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
[6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
[online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s
[7] Auto-Machine Learning: The Magic Behind Einstein
[online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s