Project "Tourist Factory" UPC
Winners of the Big Data Talent Award 2016!!!
Twitter + Lambda Architecture (Spark, Kafka, Flume, Cassandra) + Machine Learning.
Tutorial on AI-based Analytics in Traffic ManagementBiplav Srivastava
This is the tutorial on AI analytical techniques for traffic management presented at the IJCAI 2013 conference, Beijing, China presented by Biplav Srivastava and Akshat Kumar.
Tutorial on AI-based Analytics in Traffic ManagementBiplav Srivastava
This is the tutorial on AI analytical techniques for traffic management presented at the IJCAI 2013 conference, Beijing, China presented by Biplav Srivastava and Akshat Kumar.
It gives an overview of Sentiment Analysis, Natural Language Processing, Phases of Sentiment Analysis using NLP, brief idea of Machine Learning, Textblob API and related topics.
Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback
In this talk, Ashrith will be introducing you to the idea of using machine learning for detecting money laundering. The idea behind using ML for detecting money laundering is that the current rules-based engine have limited visibility into money movement. And as models learn the nuances of money movement, especially illegal, much better money laundering detection is possible.
Bio: Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
https://www.linkedin.com/in/abarthur/
What Is Sentiment Analysis?
Problem Statement
Why Twitter data?
The Process at a Glance
Methodology: How are we doing it?
Pre-processing of the datasets
Extract the candidate or take it as user input.
Calculate sentiment
Visualizing the candidate data
What visualization are we talking about?
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
With the abundance of remote sensing satellite imagery, the possibilities are endless as to the kind of insights that can be derived from them. One such use is to determine land use for agriculture and non-agricultural purposes.
In this talk, we’ll be looking at leveraging Sentinel-2 satellite imagery data along with OpenStreetMap labels to be able to classify land use as agricultural or non-agricultural.
Sentinel-2 data has a 10-meter resolution in RGB bands and is well-suited for land use classification. Using these two datasets, many different machine learning tasks can be performed like image segmentation into two classes (farm land and non-farm land) or more challenging task of identification of crop type being cultivated on fields.
For this talk, we’ll be looking at leveraging convolutional neural networks (CNNs) built with Apache MXNet to train deep learning models for land use classification. We’ll be covering the different deep learning architectures considered for this particular use case along with the appropriate metrics.
We’ll be leveraging streaming pipelines built on Apache Flink and Apache NiFi for model training and inference. Developers will come away with a better understanding of how to analyze satellite imagery and the different deep learning architectures along with their pros/cons when analyzing satellite imagery for land use. SUNEEL MARTHI and CHRIS OLIVIER, Software Development Engineer Amazon Web Services
Digital redefinition of banking banking transformationDraup
The increase in the number of digital use cases in the banking and financial services industry has led to the emergence of newer digital hotspots in the US. States such as Minnesota, North Carolina, Texas, and California have a high density of mature talent specializing in these digital cases. These digital use cases have also given rise to new hotspots in neighbouring states such as Iowa, Arizona, and Ohio. Bank of America, Wells Fargo, and JP Morgan Chase have capitalized on this rapid digitalization to create solutions in anti-money laundering, digital wealth management, information security, cloud technology.
Analysing the Digital Maturity of Top US Banks
The digital maturity of banks and financial institutions has been measured by their competency in innovation which includes their competitive intensity and growth potential and assessing their capabilities in terms of talent scalability and maturity of skills in new age technologies. By these parameters, firms such as Bank of America, Wells Fargo, Citi, and Capital One have identified as digital leaders while Union Bank, First Republic Bank, HSBC US have been relatively slower in the digital race.
Case-by-Case Analysis of Banking Transformation
Bank of America:
Bank of America has over 14 digital centres with over 76% of the digital talent based out of centres located in the US. The 4,000+ digital workforce is involved in functions such as app development, analytics, security, and cloud. Bank of America is one of the few leading banks looking to increase the digital capabilities of all its bank branches through interactive systems that need very little human intervention. Some branches are also fully automated equipped with an interactive teller machine and a video conferencing room.
Citi Group:
Citi is taking cues from its innovation labs that are involved in developing cutting-edge solutions such as beacons. The firm’s 3,500+ digital talent pool is predominantly based out of North America. The bank’s smart branches are equipped with interactive media walls that display local weather, stock information, and financial updates. Citi announced their partnership with Nasdaq which was formed to create payment systems that use DLT (Distributed Ledger Technology) to record payments.
Wells Fargo:
The firm’s large 7,500+ digital workforce is largely consolidated in the United States with sporadic distribution in India as well. The firm has 15 digital centres with only 2 of them located outside the US i.e. in Hyderabad, and Bengaluru. Over 28% of digital talent is involved in new-age solutions such as RPA, Blockchain, IoT and AI.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
This is an Image Semantic Segmentation project targeted on Satellite Imagery. The goal was to detect the pixel-wise segmentation map for various objects in Satellite Imagery including buildings, water bodies, roads etc. The data for this was taken from the Kaggle competition <https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection>.
We implemented FCN, U-Net and Segnet Deep learning architectures for this task.
Taller práctico de introducción a técnicas de Text Mining con lenguaje R. El taller consiste en analizar un de tweets, del cual se tratará de extraer conocimiento. Se analizan las palabras frecuentes, la asociación entre ellas, buscamos descubrir insigths y averiguar si existen conjuntos temáticos.
El taller es una introducción básica a las técnicas de Text Mining, tan útiles hoy en día para descubrir insights en los conjuntos de textos que forman parte de nuestro ecosistema de datos (redes sociales, comentarios de usuarios, correos electrónicos, campos de texto abierto en encuestas,...), pero que muchas veces no sabemos aprovechar.
Slides de la charla "Machine learning a lo berserker". Una charla que consiste en explicar Machine Learning a lo bruto y con un poco de irresponsabilidad :P Nada de mates y un poco de sentido práctico.
Más información: http://berserker.science
It gives an overview of Sentiment Analysis, Natural Language Processing, Phases of Sentiment Analysis using NLP, brief idea of Machine Learning, Textblob API and related topics.
Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback
In this talk, Ashrith will be introducing you to the idea of using machine learning for detecting money laundering. The idea behind using ML for detecting money laundering is that the current rules-based engine have limited visibility into money movement. And as models learn the nuances of money movement, especially illegal, much better money laundering detection is possible.
Bio: Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
https://www.linkedin.com/in/abarthur/
What Is Sentiment Analysis?
Problem Statement
Why Twitter data?
The Process at a Glance
Methodology: How are we doing it?
Pre-processing of the datasets
Extract the candidate or take it as user input.
Calculate sentiment
Visualizing the candidate data
What visualization are we talking about?
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
With the abundance of remote sensing satellite imagery, the possibilities are endless as to the kind of insights that can be derived from them. One such use is to determine land use for agriculture and non-agricultural purposes.
In this talk, we’ll be looking at leveraging Sentinel-2 satellite imagery data along with OpenStreetMap labels to be able to classify land use as agricultural or non-agricultural.
Sentinel-2 data has a 10-meter resolution in RGB bands and is well-suited for land use classification. Using these two datasets, many different machine learning tasks can be performed like image segmentation into two classes (farm land and non-farm land) or more challenging task of identification of crop type being cultivated on fields.
For this talk, we’ll be looking at leveraging convolutional neural networks (CNNs) built with Apache MXNet to train deep learning models for land use classification. We’ll be covering the different deep learning architectures considered for this particular use case along with the appropriate metrics.
We’ll be leveraging streaming pipelines built on Apache Flink and Apache NiFi for model training and inference. Developers will come away with a better understanding of how to analyze satellite imagery and the different deep learning architectures along with their pros/cons when analyzing satellite imagery for land use. SUNEEL MARTHI and CHRIS OLIVIER, Software Development Engineer Amazon Web Services
Digital redefinition of banking banking transformationDraup
The increase in the number of digital use cases in the banking and financial services industry has led to the emergence of newer digital hotspots in the US. States such as Minnesota, North Carolina, Texas, and California have a high density of mature talent specializing in these digital cases. These digital use cases have also given rise to new hotspots in neighbouring states such as Iowa, Arizona, and Ohio. Bank of America, Wells Fargo, and JP Morgan Chase have capitalized on this rapid digitalization to create solutions in anti-money laundering, digital wealth management, information security, cloud technology.
Analysing the Digital Maturity of Top US Banks
The digital maturity of banks and financial institutions has been measured by their competency in innovation which includes their competitive intensity and growth potential and assessing their capabilities in terms of talent scalability and maturity of skills in new age technologies. By these parameters, firms such as Bank of America, Wells Fargo, Citi, and Capital One have identified as digital leaders while Union Bank, First Republic Bank, HSBC US have been relatively slower in the digital race.
Case-by-Case Analysis of Banking Transformation
Bank of America:
Bank of America has over 14 digital centres with over 76% of the digital talent based out of centres located in the US. The 4,000+ digital workforce is involved in functions such as app development, analytics, security, and cloud. Bank of America is one of the few leading banks looking to increase the digital capabilities of all its bank branches through interactive systems that need very little human intervention. Some branches are also fully automated equipped with an interactive teller machine and a video conferencing room.
Citi Group:
Citi is taking cues from its innovation labs that are involved in developing cutting-edge solutions such as beacons. The firm’s 3,500+ digital talent pool is predominantly based out of North America. The bank’s smart branches are equipped with interactive media walls that display local weather, stock information, and financial updates. Citi announced their partnership with Nasdaq which was formed to create payment systems that use DLT (Distributed Ledger Technology) to record payments.
Wells Fargo:
The firm’s large 7,500+ digital workforce is largely consolidated in the United States with sporadic distribution in India as well. The firm has 15 digital centres with only 2 of them located outside the US i.e. in Hyderabad, and Bengaluru. Over 28% of digital talent is involved in new-age solutions such as RPA, Blockchain, IoT and AI.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
This is an Image Semantic Segmentation project targeted on Satellite Imagery. The goal was to detect the pixel-wise segmentation map for various objects in Satellite Imagery including buildings, water bodies, roads etc. The data for this was taken from the Kaggle competition <https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection>.
We implemented FCN, U-Net and Segnet Deep learning architectures for this task.
Taller práctico de introducción a técnicas de Text Mining con lenguaje R. El taller consiste en analizar un de tweets, del cual se tratará de extraer conocimiento. Se analizan las palabras frecuentes, la asociación entre ellas, buscamos descubrir insigths y averiguar si existen conjuntos temáticos.
El taller es una introducción básica a las técnicas de Text Mining, tan útiles hoy en día para descubrir insights en los conjuntos de textos que forman parte de nuestro ecosistema de datos (redes sociales, comentarios de usuarios, correos electrónicos, campos de texto abierto en encuestas,...), pero que muchas veces no sabemos aprovechar.
Slides de la charla "Machine learning a lo berserker". Una charla que consiste en explicar Machine Learning a lo bruto y con un poco de irresponsabilidad :P Nada de mates y un poco de sentido práctico.
Más información: http://berserker.science
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
A story of how I built an entire system not using a single server to host it, but serving everything in the cloud.
#IoT #Serverless #Reactjs #Redux #Nodejs #AWSLamdba
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
Learn about the promise of data lakes:
- Store all types of data in its raw format
- Create refined, standardized, trusted datasets for various use cases
- Store data for longer periods of time to enable historical analysis - Query and Access the data using a variety of methods
- Manage streaming and batch data in a converged platform
- Provide shorter time-to-insight with proper data management and governance
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
Alexey Zinoviev presented this paper on the PiterPy conference http://it-sobytie.ru/events/3275.
This paper covers next topics: Data Mining, Machine Learning, Python, SciPy, NumPy, Pandas, NetworkX, Scikit-learn, Octave, R language
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
The Data Science Process - Do we need it and how to apply?Ivo Andreev
Machine learning is not black magic but a discipline that involves statistics, data science, analysis and hard work. From searching patterns and data preparation through applying and optimizing algorithms to obtaining usable predictions, one would need background and appropriate tools.
But do we need it, when there is already available AI as a service solution out there? Do we need to try hard with artificial neural networks? And if we decide to do so, what tools would be a safe bet?
In this session we will go through real world examples, mention key tools from Microsoft and open source world to do data science and machine learning and most importantly - we will provide a workflow and some best practices.
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
MLOps Lifecycle
ML problem framing
ML solution architecture
Data preparation and processing
ML model development
ML pipeline automation and orchestration
ML solution monitoring, optimization, and maintenance
Evolution of Real-time User Engagement Event Consumption at PinterestHostedbyConfluent
"We will discuss how we at Pinterest transformed real time user engagement event consumption.
Every day, we log hundreds of billions of user engagement events across different domains to a few common Kafka topics which are consumed by hundreds of real time applications. These real time applications were built upon diverged frameworks (e.g. Spark Streaming, Storm, Flink, and internally developed frameworks using Kafka Consumer API) without standardization on processing logics. It led to repeated processing of similar logic, multiple codebases to maintain, low data quality, and inconsistency with offline datasets. These negatively impact scalability, reliability, efficiency and data accuracy of these applications and eventually affect the real-time content recommendation quality and user experience.
To address these challenges, we unified the way of consuming events in our real time applications by consolidating the compute engines to Flink, splitting events in those common topics by engagement types, generating cleansed events with standardized processing to align on business concepts. Throughout these efforts, we achieved multi-million dollar infrastructure savings and double-digit engagement gain after applications adopted those cleansed events.
Moving forward, we are implementing frameworks for better tracking and governing the Kafka events and real time use cases."
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
Bài techtalk của anh Khải Trần nói về hệ thống data pipeline của LinkedIn được dùng để thu thập hàng chục tỷ messages mỗi ngày, và cách họ chạy hệ thống real-time processing để thống kê lượng dữ liệu này cho mục đính metrics monitoring.
1 số điểm bài talk sẽ chia sẻ:
- Giới thiệu về hệ thống unified metrics platform của LinkedIn
- Cách LinkedIn setup hệ thống BigData pipeline dùng Kafka, HDFS, Apache Calcite và Apache Samza.
- Khái niệm nearline storage, và cách LinkedIn chuyển từ offline architecture sang nearline architecture.
Speaker: Khai Tran, Staff Software Engineer - LinkedIn.
- Hiện đang là staff software engineer ở LinkedIn, phụ trách hệ thống metrics monitoring system. Trước đây từng làm ở Amazon AWS và Oracle.
- PhD, University of Wisconsin-Madison, nghiên cứu về Database Systems.
The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
User Behavior Hashing for Audience ExpansionDatabricks
Learning to hash has been widely adopted as a solution to approximate nearest neighbor search for large-scale data retrieval in many applications. Applying deep architectures to learning to hash has recently gained increasing attention due to its computational efficiency and retrieval quality.
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
Alexey Zinoviev presented this paper on the Jocker conference http://jokerconf.com/#zinoviev.
This paper covers next topics: Data Mining, Machine Learning, Mahout, Spark, MLlib, Python, Octave, R language
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...Insight Technology, Inc.
Migrating Oracle based applications to MariaDB has become easier and economically advantageous with the feature set of MariaDB 10.2 and the upcoming 10.3 release. We’ll present details of the features that led DBS Bank to migrate mission critical applications to MariaDB.
Introducción a la IA generativa. Slides de una charla de 20 minutos que impartí en el otro día sobre IA generativa. Es el resumen de una charla extendida que tengo sobre la IA generativa.
La transición a la web3 y al Internet inmersivo será gradual, pero desde ya se inicia una trepidante transformación tecnológica, económica, política y social, que cambiará la forma en la que vivimos, socializamos, jugamos, compramos, trabajamos, colaboramos y nos organizamos.
Esta charla hace una reflexión escépticoptimista sobre cómo la convergencia tecnológica (IA + blockchain + smart contracts + 5G + IoT + RV + RA) construye la nueva era de Internet, donde el Metaverso y la web descentralizada son protagonistas.
El camino al Metaverso se tangibiliza en el auge hacia la virtualización. Y la posibilidad de la web descentralizada, propiciada por el blockchain, se traduce en un cambio de paradigma, no solo tecnológico sino también de los modelos de negocio y organizativos.
El video de la charla es este: https://www.youtube.com/watch?v=D4lZCiZnROA
La charla va de reflexionar sobre cómo están convergiendo una serie de tecnologías (blockchain, smart contracts, inteligencia artificial, 5G, IoT, realidad virtual y realidad aumentada,...). Y cómo esta convergencia tecnológica construye la nueva era de Internet, donde la web descentralizada y el Metaverso son protagonistas.
El paso de la WEB 2.0 a la WEB3 no será inmediato, sino un camino largo, pero desde ya los próximos años van a ser de una trepidante transformación tecnológica y social. En la que hay muchas cosas por hacer.
Por un lado, los primeros pasos hacia la web descentralizada, propiciada por la tecnología blockchain, se traduce en un cambio de paradigma, no solo tecnológico sino también en los modelos de negocio.
Por otro lado, el camino al Metaverso se materializa a corto plazo en el auge de la virtualización de los dos mundos en los que hoy estamos: la realidad (el mundo físico) e Internet (el mundo virtual).
Estamos a las puertas de la inteligenciación de todo. El auge del Machine Learning, y que escuchemos hablar de ello en todas partes, es la evidencia práctica del resurgir de la Inteligencia Artificial, que había estado hibernando hasta hoy en películas y laboratorios. La era de la inteligenciación ha empezado a transformarlo todo, y convergirá en pocos años con otras tendencias tecnológicas que hará de este cambio algo radical en todos los ámbitos sociales y económicos. Y esto sólo será el principio. Pronto todo irá muy deprisa. Nuestra vida va a cambiar radicalmente gracias a la inteligencia artificial. #Bilbostack2018
Es importante saber diferenciar conceptos. Mucha gente usa Data Science y Big data como cromos intercambiables. Vale que al final usaremos "Big Data" para todo. Pero podemos intentar mezclar lo menos posible.
Big Data y Data Science están solapadas y conectadas, aunque son conceptualmente distintas. Ambas se necesitan.
Cuando tenemos muchos datos, cuando tenemos potencia de cálculo, queremos automatizar al máximo las decisiones derivadas en el tiempo más casi real posible. Esto es Big Data + Data Science. Podríamos hablar de Big Data Science y nos olvidamos de confusiones ;)
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...Beatriz Martín @zigiella
Charla-tallar impartida en las #JEID17 de FESABID sobre cómo las oleadas provocadas por tecnologías digitales cambian los hábitos de consumo del usuario y cómo estos cambios de consumo implican repensar los servicios. Digital va de respensarlo todo.
Cómo es de importante tener en cuenta la cultura organizacional en cualquier "intervención" que queramos hacer en una empresa. Y cómo entender que un cambio cultural es de los más complejos y difíciles en una empresa.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. Twitter content-based Recommendation System
Barcelona Tourist City Monitor & Insights
01.07.2016
#MACHINELEARNING #SPARK #KAFKA #CASSANDRA
Juan Pablo López
Rodica Fazakas
Yulia Zvyagelskaya
Beatriz Martín
BIG DATA MANAGEMENT AND ANALYTICS
POSTGRADUATE COURSE - FINAL PROJECT
2. The Challenge
Build content-based recommendation system to provide real-time personalized
recommendations to Social Media users and insights visualization for touristic and
smart city sector
The product is addressed to:
● Middle and small companies connected to touristic sector, both of B2B&B2C
model (Leisure/Travel, Tour operators, tourist online portals, Retail, HoReCa,
etc.)
● City and neighborhood public departments and administrations
● Event agencies and managers
● Advertising and marketing agencies
4. The Challenge
Aims of the project:
● Twitter data collection and management
● Tourists vs. residents classification
● Topic (user interest) modeling
● Recommendation system implementation
● Real-time streaming statistic calculation
● Predictive model application for streaming
5. The Challenge
Main tasks of the project:
● Design and implement the architecture that is able to scale and measure high
volume data traffic
● Real-time requests response
● Use advanced ML supervised and unsupervised techniques
● Extract valuable relevant information (insights) of managed data to deliver
tangible business results to the customers
● Provide user-friendly visualization and presentation of the extracted
information
8. Data Source
ENGLISH, FRENCH, RUSSIAN
[41.34,2.03,41.45,2.25]
Tweets geolocated in Barcelona Tweets with Barcelona KW
Barcelona
Sagradafa
MWC
9. Data Source (amount of data)
[41.34,2.03,41.45,2.25]
All languages: 20.000 tweets/day
Only EN, FR, RU: 7.000 tweets/day
All languages: 250.000 tweets/day
Only EN, FR, RU: 80.000 tweets/day
Barcelona
Sagradafa
MWC
22. Batch Processing: Pre Process
● Collect
●
Pre
Process
● Read Geolocated Tweets stored in HDFS
● Clean Tweet Text (lowercase, numbers, spaces,tabs,etc..)
● Categorize users (tourist, resident), comparing geolocation of last 200
tweets
● Save in Cassandra for ML processes
30. Data Analytics
Tasks:
● Geotagged data tourists vs. residents detection algorithm implementation
● Non-geotagged data tourists vs. residents classification with supervised
machine learning
● Topic (user interest) modeling with unsupervised machine learning
● Recommendation system building
● Statistics calculation
● Visualization
31. Text Preprocessing
● remove url’s;
● remove @ sign tags from the data;
● remove any number characters, e.g. 1 or 3.14 (removeNumbers);
● remove any punctuation characters (removePunctuation);
● convert all text to lower case (tolower);
● include only words that have a minimum character length of 3;
● remove certain stop words from the data;
● reduce words to their ‘stems’, e.g. ‘walk’ is the stem of ‘walking’ and ‘walked’
(stemming);
32. SVM: data tourists vs. residents classification
Challenge: meanwhile only less than 1% is geotagged, the twitter users have to be
classified for tourists and residents to extract further insights and topics of
interests
Aim: build a predictive model to classify non-geotagged twitter texts to distinguish
tourists from residents.
33. SVM: data tourists vs. residents classification
Dataset: labeled data collection of tweet texts (only from Barcelona) as
independent variable and labels (TRUE for tourist/FALSE for resident) as predictor
variable
Validation protocol:
● Training set (60% of the original dataset) to build up prediction algorithm
● Cross-Validation set (20%) to compare the performances and choose the
algorithm with the best one
● Test set (20%) to apply best prediction algorithm and get an idea about its
performance on unseen data
34. SVM: data tourists vs. residents classification
Prototyping
● Naive Bayes
● Logistic Regression (Maxent)
● k-NN
● SVM
35. SVM: data tourists vs. residents classification
Reasons why SVMs perform well for text categorization
SVMs:
● Acknowledge the particular properties of text: high dimensional feature
spaces, few irrelevant features (dense concept vector), and sparse instance
vectors
● Outperform other techniques substantially and significantly
● Eliminate the need for feature selection, making text categorization
considerably easier
● Are robust and do not require much parameter tuning
36. Topic Modeling
We use topic modelling to automatically detect topics of interest to Twitter users
previously detected as tourists.
● Uncover the hidden topical structure in tweets.
● Assign topics to users.
● Use these assignments to make targeted recommendation
37. Topic Modeling
Dataset
● Geolocalized tweets from Barcelona, aggregated by identified tourist
Algorithm: baseline Latent Dirichlet Allocation (LDA)
● Unsupervised learning technique
● Extracts key topics. Each topic is an ordered list of representative words.
● Describes each doc in the corpus based on allocation to the extracted topics.
38. Topic Modelling : LDA Topics
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4
direct love primavera humid photo
work peopl sound wind love
lip happi festiv cloud beauti
june life drink temperatur hotel
book birthdai night finish camp
market hope plai summer centr
design girl live sant view
chang game stage block beach
39. Recommendation System
user_id topic word recommendation
6448 sports game Bowling Pedralbes, Camp Nou, Museu del FC Barcelona
7296 festivals festiv Festival el Grec, Sonar
1239 sports plai Bowling Pedralbes, Camp Nou, Museu del FC Barcelona
2980 shopping market Boqueria, La Roca Village, Portal del Angel
3501 nature beach Font Magica, Park Guell, Playa de la Barceloneta