- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
This is a presentation we perform internally every quarter as part of our Data Science Brown Bag Series. This presentation was talking about different types of soft clustering techniques - all of which the team currently performs depending on the complexity of the data and the complexity of customer problems. If you are interested in learning more about working with L-3 Data Tactics or interested in working for the L-3 Data Tactics Data Science team please contact us soon! Thank you.
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
The emergence of deep learning based methods for search poses several challenges and opportunities not just for modeling, but also for benchmarking and measuring progress in the field. Some of these challenges are new, while others have evolved from existing challenges in IR benchmarking exacerbated by the scale at which deep learning models operate. Evaluation efforts such as the TREC Deep Learning track and the MS MARCO public leaderboard are intended to encourage research and track our progress, addressing big questions in our field. The goal is not simply to identify which run is "best" but to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This entails a wider conversation in the IR community about what constitutes meaningful progress, how benchmark design can encourage or discourage certain outcomes, and about the validity of our findings. In this talk, I will present a brief overview of what we have learned from our work on MS MARCO and the TREC Deep Learning track--and reflect on the state of the field and the road ahead.
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
The emergence of deep learning-based methods for information retrieval (IR) poses several challenges and opportunities for benchmarking. Some of these are new, while others have evolved from existing challenges in IR exacerbated by the scale at which deep learning models operate. In this talk, I will present a brief overview of what we have learned from our work on MS MARCO and the TREC Deep Learning track, and reflect on the road ahead.
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
This is a presentation we perform internally every quarter as part of our Data Science Brown Bag Series. This presentation was talking about different types of soft clustering techniques - all of which the team currently performs depending on the complexity of the data and the complexity of customer problems. If you are interested in learning more about working with L-3 Data Tactics or interested in working for the L-3 Data Tactics Data Science team please contact us soon! Thank you.
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
The emergence of deep learning based methods for search poses several challenges and opportunities not just for modeling, but also for benchmarking and measuring progress in the field. Some of these challenges are new, while others have evolved from existing challenges in IR benchmarking exacerbated by the scale at which deep learning models operate. Evaluation efforts such as the TREC Deep Learning track and the MS MARCO public leaderboard are intended to encourage research and track our progress, addressing big questions in our field. The goal is not simply to identify which run is "best" but to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This entails a wider conversation in the IR community about what constitutes meaningful progress, how benchmark design can encourage or discourage certain outcomes, and about the validity of our findings. In this talk, I will present a brief overview of what we have learned from our work on MS MARCO and the TREC Deep Learning track--and reflect on the state of the field and the road ahead.
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
The emergence of deep learning-based methods for information retrieval (IR) poses several challenges and opportunities for benchmarking. Some of these are new, while others have evolved from existing challenges in IR exacerbated by the scale at which deep learning models operate. In this talk, I will present a brief overview of what we have learned from our work on MS MARCO and the TREC Deep Learning track, and reflect on the road ahead.
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Rich Heimann
Big Social Data: The Spatial Turn in Big Data
By Richard Heimann & Abe Usher
University of Maryland Baltimore County Webinar Description:
The increased access to spatial data and overall improved application of spatial analytical methods present certain potential to social scientific research. This webinar is designed to focus on substantive social science research perspectives while exposing rewards involved in the application of geographic information systems (GIS), Big Data, and spatial analytics in their own research.
What is witnessed as the hype of Web 2.0 has worn off and the collaborative use of the Internet becomes a societal norm is an unprecedented explosion in the creation and analysis of geospatial data. Just as major governments are reducing their investments in location intelligence, individuals and non-government organizations are fueling a bonfire of innovation in the world of GIS data.
Traditional spatial analyses grew up in an era of sparse data and very weak computational power. Today, both of those circumstances are reversed and many of the old solutions are no longer suitable to answer todays questions.
"Big Social Data: The Spatial Turn in Big Data" reflects this change and combines two things which, until recently, engaged quite different groups of researchers and practitioners. Together, they require particular techniques and a sophisticated understanding of the special problems associated with spatial social data. Geographic Data Mining, or Geographic Knowledge Discovery, is not new, but is developing and changing rapidly as both more, and different, data becomes available, and people see new applications. The days of ‘Big Data’ require fresh thinking.
The webinar will highlight connections between spatial concepts and data availability. New emerging social media data will be promoted over traditional social science data, which better reflect some of the more recently developments in Big Data - most notably the socially critical exploration of such data.
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
HU, Xiao (University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_619.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationBlerina Spahiu
An increasing number of research and industrial initiatives
have focused on publishing Linked Open Data, but little attention has been provided to help consumers to better understand existing data sets. In this paper we discuss how an ontology-driven data abstraction model supports the extraction and the representation of summaries of linked data sets. The proposed summarization model is the backbone of the ABSTAT framework, that aims at helping users understanding big and complex linked data sets. Our framework is evaluated by showing that
it is capable of unveiling information that is not explicitly represented in underspecified ontologies and that is valuable to users, e.g., helping them in the formulation of SPARQL queries.
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Rich Heimann
Big Social Data: The Spatial Turn in Big Data
By Richard Heimann & Abe Usher
University of Maryland Baltimore County Webinar Description:
The increased access to spatial data and overall improved application of spatial analytical methods present certain potential to social scientific research. This webinar is designed to focus on substantive social science research perspectives while exposing rewards involved in the application of geographic information systems (GIS), Big Data, and spatial analytics in their own research.
What is witnessed as the hype of Web 2.0 has worn off and the collaborative use of the Internet becomes a societal norm is an unprecedented explosion in the creation and analysis of geospatial data. Just as major governments are reducing their investments in location intelligence, individuals and non-government organizations are fueling a bonfire of innovation in the world of GIS data.
Traditional spatial analyses grew up in an era of sparse data and very weak computational power. Today, both of those circumstances are reversed and many of the old solutions are no longer suitable to answer todays questions.
"Big Social Data: The Spatial Turn in Big Data" reflects this change and combines two things which, until recently, engaged quite different groups of researchers and practitioners. Together, they require particular techniques and a sophisticated understanding of the special problems associated with spatial social data. Geographic Data Mining, or Geographic Knowledge Discovery, is not new, but is developing and changing rapidly as both more, and different, data becomes available, and people see new applications. The days of ‘Big Data’ require fresh thinking.
The webinar will highlight connections between spatial concepts and data availability. New emerging social media data will be promoted over traditional social science data, which better reflect some of the more recently developments in Big Data - most notably the socially critical exploration of such data.
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
HU, Xiao (University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_619.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationBlerina Spahiu
An increasing number of research and industrial initiatives
have focused on publishing Linked Open Data, but little attention has been provided to help consumers to better understand existing data sets. In this paper we discuss how an ontology-driven data abstraction model supports the extraction and the representation of summaries of linked data sets. The proposed summarization model is the backbone of the ABSTAT framework, that aims at helping users understanding big and complex linked data sets. Our framework is evaluated by showing that
it is capable of unveiling information that is not explicitly represented in underspecified ontologies and that is valuable to users, e.g., helping them in the formulation of SPARQL queries.
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATADataTactics
USMA Cadet leverages GDELT Global Knowledge Graph (GKG) to quantify global human society beyond cataloging physical occurrences and network structure of the global news.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on November 20, 2013, at the "IBM Developer Days 2013" in Zurich, Switzerland.
ABSTRACT
There is no question that big data has hit the business, government and scientific sectors. The demand for skills in data science is unprecedented in sectors where value, competitiveness and efficiency are driven by data. However, there is plenty of misleading hype around the terms big data and data science. This presentation gives a professional statistician's view on these terms and illustrates the connection between data science and statistics.
The presentation is also available at http://www.statoo.com/BigDataDataScience/.
An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, Sweden
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple's differential privacy deployment for iOS / macOS, Google's RAPPOR, LinkedIn Salary, and Microsoft's differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.
Similar to Data Science and Analytics Brown Bag (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. The Power of Partnership – from Vision to Reality
L-3 Data Tactics:
Data Science Brown Bag
Welcome!
Hard and Soft Clusters and Cyber Data
April 22, 2014
!
R2 = 500; p<.05
asymptotically approaching perfect
2. !
•Why a (our 3rd) Data Science Brown Bag (Rich H.)?
!
•About US & About YOU (Rich H.)!!
!
•Case Studies in Cyber:
•What is Clustering, Honeypots and Density Based Clustering (Max W.)?
•What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection. (David P.)
•What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection. (Nathan D.)
!
•On the horizon...(Rich H.)
DT Data Science Brown Bag: Outline
L-3
3. DT Data Science Brown Bag: Outline
Learning [close] at a pace similar to the pace at which we learn.
!
Learning and Educating from/to DS to PMs, SWE, and OPs.
!
DS2PM: Provide insights for FRIs/RFPs.
PM2DS: Atmospherics from our costumers.
!
DS2SWE: Integrating algorithms.
SWE2DS: Accessing data spaces.
!
DS2OP: How do you consume the outputs of models?
OP2DS: What models are best to present to OPs?
DS: Data Scientist, PM: Program Managers, SWE: Software Engineers, OP: Operators
L-3
4. The Team:
(Geoffrey B., Nathan D., Rich H., David P., Ted P., Shrayes R., Jonathan T., Adam VE., Max W.)
!
Graduates from top universities…
…many of whom are EMC Data Science Certified.
!
Advanced degrees include:
mathematics, computer science, astrophysics, electrical
engineering, mechanical engineering, statistics, social sciences.
!
Base competencies (horizontals): clustering, association rules,
regression, naive bayesian classifier, decision trees, time-series, text
analysis.
!
Going beyond the base (verticals)...
About Us: DT Data Science Team
L-3
5. About Us: DT Data Science Team
L-3
Clustering || Regression || Decision Trees || Text Analysis
Association Rules || Naive Bayesian Classifier || Time Series Analysis
econom
etrics
spatialeconom
etrics
graph
theory
algorithm
s
astrophysicaltim
e-series
analysis
path
planning
algorithm
s
bayesian
statistics
constrained
optim
izations
num
ericalintegration
techniques
PCA
bagging/boosting
hierarchicalm
odels
IRT
space-tim
e
latentclass
analysis
structuralequation
m
odeling
m
ixture
m
odels
SVM
m
axent
CART
autoregressive
m
odels
ICA
factoranalysis
random
forest
dim
ensionalreduction
topic
m
odels
sentim
entanalysis
frequency
dom
ain
patterns
unsupervised
by
supervised
change-pointm
odels
LUBAP
DLISA
DBAC
optics
clustering
7. !
!
No Free Lunch (NFL) theorems: no algorithm performs
better than any other when their performance is averaged
uniformly over all possible problems of a particular type.
Algorithms must be designed for a particular domain or style
of problem, and that there is no such thing as a general
purpose algorithm.
!
!
!
About Us: DT Data Science Team
L-3
8. ABOUT YOU:
35 confirmed, 15 webex, 21 Data Tactics employees, 13 L-3 NSS
employees; Sam Posten was the first to sign-up (webex) and Aaron Glahe was
the first to sign-up for in-person!
!
# define Twitter account names
start <- getUser(“L3_NSS”)
finish <- getUser(“DataTactics”)
!
# find all connections independently of each account
dt.friends.object <- lookupUsers(start$getFriendsIDs())
l3.friends.object <- lookupUsers(finish$getFriendsIDs())
!
#create one large table that relates followers from each account
relations <- merge(data.frame(User=“DataTactics”, follower=dt.friends),
data.frame(User=l3.friends, Followers=“L3_NSS”), all=TRUE)
!
#create network layout showing each account’s community and overlap
g.followers <- graph.data.frame(relations.followers, directed = T)
!
#finally plot the graph
tkplot(g) L-3
11. Why Clustering?
L-3
Six Pillars of Data Mining:
Clustering has become a workhorse in Big Data and fits into the Six Pillars of Data Mining and our own
DS4PM & DS4G framework.
• Anomaly detection: the identification of unusual data records, that might be interesting or data errors that require
further investigation.
• Association rule learning: searches for relationships between variables.
• Clustering: is the task of discovering groups and structures in the data that are in some way or another "similar",
without using known structures in the data.
• Classification: is the task of generalizing known structure to apply to new data.
• Regression: finds a function which models the data with the least error.
• Summarization: providing a more compact representation of the data set.
!
Taxonomy of Questions (ref: DS4PM):
!
• Causal Effects: is an approach to the statistical analysis of cause and effect based on the framework of potential
outcomes
• Classification/Clustering: identifying to which of a set of observations belong, on the basis of a training set of data
or without labels in the clustering approach.
• Outlier Detection: is the identification of events which do not conform to an expected pattern or other items in a
dataset.
• Big Data and Analytics: discovering interesting relations between variables in large databases
• Measurement Models: statistical models to measure the relationships between the observable variables and the
unobserved (or “latent”) quantity
• Text Analysis: refers to the process of deriving high-quality information from text.
12. Max Watson: Max’s background is in physics and applied mathematics. Max
completed his undergraduate degree at University of California, Berkeley and
completed his PhD at University of California, Santa Barbara in 2012. Max
specializes in large-scale simulations, signal analysis and statistical physics - he
joined the Data Tactics team in January 2014 and has supported DHS. Max is an
EMC Certified Data Scientist.
David Pekarek: David’s background is in Mechanical Engineering and specializes
in mechanical control systems, optimization, and spatio-temporal statistics. David
finished his PhD in 2010 from California Institute of Technology and joined Data
Tactics in the fall of 2012 and currently supports DARPA.
Nathan Danneman: Nathan’s background is in political science, with
specializations in applied statistics and international conflict. He finished his PhD
in June of 2013, and joined Data Tactics in May of that same year. He recently co-
authored Social Media Mining with R, is active in the local Data Science
community and currently supports DARPA. Nathan is an EMC Certified Data
Scientist.
!
Today’s presenters:
L-3
14. Outline
14
• What are Honeypots?
!
• Cluster Analysis
! -General Principles
-Density Based Clustering
!
• Cluster Analysis Applied to Honeypot Data
!
• Conclusions
!
L-3
15. Honeypots
15
8 websites: (USA, 4), (Singapore, 2), (Brazil, 2) [brought to you by Ted Procita]
!
Collection Period: October 15, 2013 - November 18, 2013
!
2 Sources of Data: requests at firewall and requests at webserver
!
number of webserver requests: ~4000
Honeypots are traps set to detect, deflect, or counteract
attempts at unauthorized use of information systems
some information from a typical ‘hit’ on the webserver:
!
IP address Country Request Timestamp
101.227.4.25 CN /robots.txt 10/17/13 17:58:21
L-3
16. Goals of Honeypot Analysis
16
• Categorize IP addresses in terms of similar requests
!
• Determine how requests vary by country
!
• Detect outliers
L-3
18. Distance I: Combine Requests
18
1) Gather all unique requests invoked by each IP address:
!
!
IP1 ⇒ { /, /robots.txt, …}
!
IP2 ⇒ { /HNAP1/, /manager/html, …}
. .
. .
. .
L-3
19. Distance II: Jaccard Similarity
19
Requests from IP address A: {♣,♦}
!
Requests from IP address B: {,♦}
Jaccard Similarity:
intersection(A, B) = {♦} union(A, B) = {♣, ♦, }
J(A, B) = 1/3
Effective Distance: D = 1 - J
0 1
D = 0 : A and B issue the same requests
D = 1 : A and B issue completely different requests
L-3
20. Distance III: All Pairs
20
Calculate effective distance between all pairs of IP addresses
!
...leaving us with:
But usually in a high number of dimensions!
L-3
23. Density Based Clustering
23
2 parameters: distance threshold and minimum number of neighbors (DBSCAN)
example:
minimum number = 2
Outliers
Clusters
L-3
29. Accomplishments
29
✓ Categorized behavior of IP addresses based on requests
!
✓
Detection of outliers
!
✓ Determined how requests vary by country (China vs. USA)
L-3
30. What Clustering Can Do for You
30
Objects + Attributes
Cluster the Objects Cluster the Attributes
Applications:
!
• IP addresses & their requests • patients & their symptoms
• devices & their malfunctions • people & their associates
L-3
32. Firewall Activity Clustering Workflow
Data Preprocessing!
and Vectorization
OPTICS !
Clustering
Follow-on
Investigations
Honey Pot!
Firewall Activity
Aggregated counts of
IP’s dest. port hits
Reachability Distance
plot identifying user
clusters and outliers
Characteristics of outlying
IPs and IP clusters
#!
#!
#!
#!
#!
#
Abc
Abc
Abc
Abc
Abc
Abc
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
Abc
Abc
Abc
Abc
Abc
Abc
~32K!
logs
time, host, src IP,
location, ports, protocol
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
~19K
IPs
128 ports
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
outliers
clusters
Distinct activity
levels on port
53
• The majority of source IPs make use of only one destination port!
• 94% of source IPs fall into some cluster with similar port usage and
traffic volume
L-3
33. OPTICS: Hierarchical Density Based Clustering
• Clustering algorithms provide a means to sort data without pre-existing labels!
• Density-based clustering methods are robust in identifying clusters with non-
uniform shapes
• The OPTICS algorithm is a density-based approach that simultaneously
evaluates cluster results at different scales
k-Means!
results
Density-based!
clustering !
results
Is this one cluster or two?!
The answer depends on scale!
L-3
34. OPTICS: How does it work?
• The OPTICS algorithm performs two major operations on the data:!
• determining an ordering of all data points, based on the likelihood of points being
clustered together
• assigning each point a Reachability Distance (R.D.): a quantification of the length
scale at which the given point will belong to any cluster
• Plotting R.D. vs the ordered data points, clusters appear as troughs
Whole face
Eye EyeSmile
Outliers
L-3
35. OPTICS: How was it applied?
Data Preprocessing!
and Vectorization
OPTICS !
Clustering
Follow-on
Investigations
• Source IPs used as the identifier for entities with traffic hitting the honey pot
firewall.
• Destination ports used to define the dimensions of feature space. Each of
the 127 most common ports (those with at least 60 hits from the total
population) got its own dimension. The remaining ‘rare’ ports bundled as a
single dimension.
• OPTICS algorithm identified clusters of IPs in 128 dimensional space, with
clustering results summarized in a 2-D reachability plot !
• Follow-on investigations performed to identify anomalous properties of
outlying IPs and commonalities among clustered IPs
L-3
37. Firewall Port Usage Clustering Results
Clusters with some!
distinctive activity
Outlying IPs!
(Their activity falls into clusters only at
extremely generous length scales)
L-3
39. Port Usage Cluster Characterization
IPs with minimal activity
on highly travelled ports
(22, 53, Other)
Outlying IPs:!
Activity on multiple ports or very
seldom used ports
1-15 hits on port 80
(HTTP)
1-10 hits on port 3389
(RDP)
1-14 hits on port 1433
(MSSQL)
1-6 hits on port 445
(Active Directory)
Small clusters with activity !
on less used ports!
(3306, 5060, 4899, 135, 25, 23,
45091, 48879, 1234)
L-3
40. Port 53 Traffic Clustering Validation
OPTICS identifies the
multimodal distribution of
traffic to port 53 (DNS)
L-3
41. Conclusions
• Destination ports show little correlation in the firewall logs. Source
IPs tend to cluster by the one port to which they sent traffic.
• OPTICS clustering efficiently sorts source IPs as outliers or
belonging to a cluster of common port usage.
• Interactive plotting tools allow for the rapid characterization of
clusters.
L-3
42. Latent Dirichlet Allocation: Characterizing
normal behavior and identifying deviations
from normality
By Nathan Danneman
L-3
43. Outline
• What is Latent Dirichlet Allocation (LDA)?
• How does it compare to other clustering tools?
• LDA by example: analyzing log files
L-3
44. LDA is a Mixture Model
• Mixture Models:
– Identify sets of variables that co-occur (behavioral patterns)
– Determine what behavioral patterns each individual exhibits
• Example: The Sports Equipment Analogy
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
L-3
45. • Mixture Models:
– Identify sets of variables that co-occur (behavioral patterns)
– Determine what behavioral patterns each individual exhibits
• Example: The Sports Equipment Analogy
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
LDA is a Mixture Model
L-3
46. LDA Utilizes Soft Clustering
• Hard Clustering: every point is
assigned to one group
• Hard Clustering with Outliers: every
point is assigned to one or no
groups
• Soft Clustering: every point is
assigned to zero, one, or several
groups. x1
x2
L-3
47. x1
x2
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
• Hard Clustering: every point is
assigned to one group
• Hard Clustering with Outliers: every
point is assigned to one or no
groups
• Soft Clustering: every point is
assigned to zero, one, or several
groups.
LDA Utilizes Soft Clustering
L-3
48. Input Data for LDA: Cyber Data
• LDA takes a matrix of counts
• Data: log files from a large network; 8700 users, 85 log types
• Each row represents a user
• Each column represents a log type
Connection!
Success
Termination!
Success
Invalid !
Login
...
User 1 0 3 2
User 2 12 3 0
User 3 3 0 18
User 4 2 22 1
User 5 7 5 9
... ...
L-3
49. LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
Log Type 1
!
Log Type 2
!
Log Type 3
!
Log Type 4
...
Behavioral
Pattern 1
!
!
Behavioral
Pattern 2
Each log relates to zero,
one, or many behavioral
patterns
L-3
50. LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
• Output 2: which behavioral pattern(s) characterize each
user
Log Type 1
!
Log Type 2
!
Log Type 3
!
Log Type 4
...
Behavioral
Pattern 1
!
!
Behavioral
Pattern 2
User 1
!
User 2
!
User 3
!
User 4
...
Users exhibit zero, one, or
many behaviors
L-3
51. LDA Workflow
• Build the N (observation) by P (log types) matrix of counts
• Use an empirical method to determine the optimal number
of behavioral patterns to estimate
• Estimate the model
Connection
(Successful)
Connection
(Failure)
Termination
(Successful)
Connection
(Time-Out)
User1 15 15 0 3
User2 8 12 2 0
L-3
56. LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
• Output 2: which behavioral pattern(s) characterize each
user
Log Type 1
!
Log Type 2
!
Log Type 3
!
Log Type 4
...
Behavioral
Pattern 1
!
!
Behavioral
Pattern 2
User 1
!
User 2
!
User 3
!
User 4
...
Users exhibit zero, one, or
many behaviors
L-3
57. Characterizing Users with Behavioral Patterns
User # 2
Essentially, entirely firewall
connections and terminations.
L-3
58. Characterizing Users with Behavioral Patterns
User # 43
Lots of failed logins!
Normal activity: connections,
terminations, logins, logoffs
L-3
60. Outlier Detection with LDA
• Mixture models make predictions about the
proportion of each log type a user will have
• We can compare the predicted proportions to each
user’s actual proportions to see how well the
model captures each user’s actions
• Typical users should be well-characterized by
mixtures of common behavioral patterns – these
are “normal” users
• Users whose actions are not mixtures of common
behavioral patterns are doing things that are
uncommon – these are outliers
L-3
65. LDA Detects Univariate Outliers
One user had 77% Windows Hosts Failed Logins; mean for data is 0.002%
Proportion of Windows Hosts: Failed Login Logs
User # 12
L-3
66. LDA Detects Conditional Outliers
User # 53 has a typical proportion
of Firewall Termination logs...
!
!
!
However, User 53 has more than twice
as many Firewall Terminations as users
with his/her same proportion of Firewall
Connections.
!
!
Percentage of Logs that are
Firewall Terminations
NumberofUsers
User 53
Firewall Terminations comprise
about 50% of many users’ logs
Percentage of Logs that are
Firewall Terminations
NumberofUsers
Firewall Terminations among
users with 53’s proportion of
Firewall Connections
User 53
L-3
67. Conclusions
• LDA allows an analyst to:
– Succinctly characterize common behavioral patterns
– Capture nuance through soft clustering
– Identify both simple and conditional outliers
!
• Next Steps:
– Radically improve parallelized versions of LDA
– Build enhanced visualizations that allow analysts to interact with data
!
• Previous Steps:
– Cyber IR&D II - Honeypots & Topic Graphs
– https://portal.data-tactics-corp.com/sites/analytics/Shared
%20Documents/honeypots.pdf
L-3
68. • Query based analytics are tenuous for data with large feature
spaces and population sizes. For complete answers, we must
analyze with comprehensive algorithms.
• Cyber systems regularly lack reliable (or stationary) models and
priors. Hence we have been focused on questions of pattern
detection (hard) and outlier detection (harder) for big cyber
data, primarily obtaining results via clustering analyses.
• There are many, many clustering algorithms, each with distinct
features and requirements (No Free Lunch for Theorems).
Choosing the most appropriate tool requires a deep
understanding of the available data, the questions at hand, and
the pros and cons of applicable methods.
Final Thoughts…
L-3
69. • L-3 Data Tactics has several minimally viable products (MVP)
working of very hard elements of the cyber analytics problem
set.
•These MVPs can be used in a support function to existing
security protocol and signature based systems - or provide
those systems already in place with pattern and anomaly
detection.
•Previous and future honeypot collection will further define L-3’s
cyber competencies in proactive cyber analytics.
Final Thoughts…
L-3
70. ...on the Horizon:
!
Honeypots and Twitter Collection Platforms
!
Summer Data Science Internship Program (Robert R. & USMA cadets):
Honeypots analytical application development
USA Civil Affairs & CERDEC Analytics
http://glimmer.rstudio.com/gosystems01/Stability/
Next Data Science Brown Bag late July.
!
DS4G & DS4PM both making appearances this year.
!
Data Science on display at the L-3 Technology Exchange 2014… more
to come.
… on the horizon.
L-3