The document proposes a method called data journey modelling to predict risks and costs for IT developments in early stages. It involves mapping existing and new data journeys through systems and identifying where data movement between different actors or formats could cause issues. The method was evaluated on 18 NHS IT case studies and accurately predicted risks in 13 out of 19 predictions. The approach provides a lightweight alternative to more complex modelling techniques for early decision making.
In this presentation, Sree Harissh introduces SMAC and associated trends. Sree is interested in data analytics and he wants to make sense of the structured.unstructured data particularly in the insurance industry.
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...Mindtrek
Paulo Canas Rodrigues
Research Director
CAST (Centre for Applied Statistics and Data Analytics) University of Tampere
The role of Statistics in the Internet of Things
Mindtrek 2016
In this presentation, Sree Harissh introduces SMAC and associated trends. Sree is interested in data analytics and he wants to make sense of the structured.unstructured data particularly in the insurance industry.
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...Mindtrek
Paulo Canas Rodrigues
Research Director
CAST (Centre for Applied Statistics and Data Analytics) University of Tampere
The role of Statistics in the Internet of Things
Mindtrek 2016
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
In this presentation, Ghansham introduces SMAC and associated trends. Having learnt nosql databases, his interest area is Big Data Analytics. He is interested to work on influencer identification and Visualization of Temporal Big Data.
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
www.impulse.net.in
Email: ieeeprojects@yahoo.com/ imbpulse@gmail.com
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
www.impulse.net.in
Email: ieeeprojects@yahoo.com/ imbpulse@gmail.com
Significant Role of Statistics in Computational SciencesEditor IJCATR
This paper is focused on the issues related to optimizing statistical approaches in the emerging fields of Computer Science
and Information Technology. More emphasis has been given on the role of statistical techniques in modern data mining. Statistics is
the science of learning from data and of measuring, controlling, and communicating uncertainty. Statistical approaches can play a vital
role for providing significance contribution in the field of software engineering, neural network, data mining, bioinformatics and other
allied fields. Statistical techniques not only helps make scientific models but it quantifies the reliability, reproducibility and general
uncertainty associated with these models. In the current scenario, large amount of data is automatically recorded with computers and
managed with the data base management systems (DBMS) for storage and fast retrieval purpose. The practice of examining large preexisting
databases in order to generate new information is known as data mining. Presently, data mining has attracted substantial
attention in the research and commercial arena which involves applications of a variety of statistical techniques. Twenty years ago
mostly data was collected manually and the data set was in simple form but in present time, there have been considerable changes in
the nature of data. Statistical techniques and computer applications can be utilized to obtain maximum information with the fewest
possible measurements to reduce the cost of data collection.
Presentation about new data, methods and outputs to create knowledge for innovation policy. Presented at the OECD Blue Sky Conference, 20 September 2016.
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
Presentation by Al Hamilton and Cody Johnson to Canberra Semantic Web Meetup Group on why producers of official statistics are interested in semantic web community (including Linked Open Data) and outlining experimental work by Cody Johnson on transforming selected Population Census data released by the ABS in SDMX-ML to RDF Data Cube Vocabulary format.
1. Web Mining – Web mining is an application of data mining for di.docxbraycarissa250
1. Web Mining – Web mining is an application of data mining for discovering data patterns from the web. Web mining is of three categories – content mining, structure mining and usage mining. Content mining detects patterns from data collected by the search engine. Structure mining examines the data which is related to the structure of the website while usage mining examines data from the user’s browser. The data collected through web mining is evaluated and analyzed using techniques like clustering, classification, and association. It is a very good topic for the thesis in data mining.
2. Predictive Analytics – Predictive Analytics is a set of statistical techniques to analyze the current and historical data to predict the future events. The techniques include predictive modeling, machine learning, and data mining. In large organizations, predictive analytics help businesses to identify risks and opportunities in their business. Both structured and unstructured data is analyzed to detect patterns. Predictive Analysis is a lengthy process and consist of seven stages which are project defining, data collection, data analysis, statistics, modeling, deployment, and monitoring. It is an excellent choice for research and thesis.
3. Oracle Data Mining – Oracle Data Mining, also referred as ODM, is a component of Oracle Advanced Analytics Database. It provides powerful data mining algorithms to assist the data analysts to get valuable insights from data to predict the future standards. It helps in predicting the customer behavior which will ultimately help in targeting the best customer and cross-selling. SQL functions are used in the algorithm to mine data tables and views. It is also a good choice for thesis and research in data mining and database.
4. Clustering – Clustering is a process in which data objects are divided into meaningful sub-classes known as clusters. Objects with similar characteristics are aggregated together in a cluster. There are distinct models of clustering such as centralized, distributed. In centroid-based clustering, a vector value is assigned to each cluster. There are various applications of clustering in data mining such as market research, image processing, and data analysis. It is also used in credit card fraud detection.
5. Text mining – Text mining or text data mining is a process to extract high-quality information from the text. It is done through patterns and trends devised using statistical pattern learning. Firstly, the input data is structured. After structuring, patterns are derived from this structured data and finally, the output is evaluated and interpreted. The main applications of text mining include competitive intelligence, E-Discovery, National Security, and social media monitoring. It is a trending topic for the thesis in data mining.
6. Fraud Detection – The number of frauds in daily life is increasing in sectors like banking, finance, and government. Accurate detection of fraud is a challenge. Da.
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
In this presentation, Ghansham introduces SMAC and associated trends. Having learnt nosql databases, his interest area is Big Data Analytics. He is interested to work on influencer identification and Visualization of Temporal Big Data.
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
www.impulse.net.in
Email: ieeeprojects@yahoo.com/ imbpulse@gmail.com
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
www.impulse.net.in
Email: ieeeprojects@yahoo.com/ imbpulse@gmail.com
Significant Role of Statistics in Computational SciencesEditor IJCATR
This paper is focused on the issues related to optimizing statistical approaches in the emerging fields of Computer Science
and Information Technology. More emphasis has been given on the role of statistical techniques in modern data mining. Statistics is
the science of learning from data and of measuring, controlling, and communicating uncertainty. Statistical approaches can play a vital
role for providing significance contribution in the field of software engineering, neural network, data mining, bioinformatics and other
allied fields. Statistical techniques not only helps make scientific models but it quantifies the reliability, reproducibility and general
uncertainty associated with these models. In the current scenario, large amount of data is automatically recorded with computers and
managed with the data base management systems (DBMS) for storage and fast retrieval purpose. The practice of examining large preexisting
databases in order to generate new information is known as data mining. Presently, data mining has attracted substantial
attention in the research and commercial arena which involves applications of a variety of statistical techniques. Twenty years ago
mostly data was collected manually and the data set was in simple form but in present time, there have been considerable changes in
the nature of data. Statistical techniques and computer applications can be utilized to obtain maximum information with the fewest
possible measurements to reduce the cost of data collection.
Presentation about new data, methods and outputs to create knowledge for innovation policy. Presented at the OECD Blue Sky Conference, 20 September 2016.
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
Presentation by Al Hamilton and Cody Johnson to Canberra Semantic Web Meetup Group on why producers of official statistics are interested in semantic web community (including Linked Open Data) and outlining experimental work by Cody Johnson on transforming selected Population Census data released by the ABS in SDMX-ML to RDF Data Cube Vocabulary format.
1. Web Mining – Web mining is an application of data mining for di.docxbraycarissa250
1. Web Mining – Web mining is an application of data mining for discovering data patterns from the web. Web mining is of three categories – content mining, structure mining and usage mining. Content mining detects patterns from data collected by the search engine. Structure mining examines the data which is related to the structure of the website while usage mining examines data from the user’s browser. The data collected through web mining is evaluated and analyzed using techniques like clustering, classification, and association. It is a very good topic for the thesis in data mining.
2. Predictive Analytics – Predictive Analytics is a set of statistical techniques to analyze the current and historical data to predict the future events. The techniques include predictive modeling, machine learning, and data mining. In large organizations, predictive analytics help businesses to identify risks and opportunities in their business. Both structured and unstructured data is analyzed to detect patterns. Predictive Analysis is a lengthy process and consist of seven stages which are project defining, data collection, data analysis, statistics, modeling, deployment, and monitoring. It is an excellent choice for research and thesis.
3. Oracle Data Mining – Oracle Data Mining, also referred as ODM, is a component of Oracle Advanced Analytics Database. It provides powerful data mining algorithms to assist the data analysts to get valuable insights from data to predict the future standards. It helps in predicting the customer behavior which will ultimately help in targeting the best customer and cross-selling. SQL functions are used in the algorithm to mine data tables and views. It is also a good choice for thesis and research in data mining and database.
4. Clustering – Clustering is a process in which data objects are divided into meaningful sub-classes known as clusters. Objects with similar characteristics are aggregated together in a cluster. There are distinct models of clustering such as centralized, distributed. In centroid-based clustering, a vector value is assigned to each cluster. There are various applications of clustering in data mining such as market research, image processing, and data analysis. It is also used in credit card fraud detection.
5. Text mining – Text mining or text data mining is a process to extract high-quality information from the text. It is done through patterns and trends devised using statistical pattern learning. Firstly, the input data is structured. After structuring, patterns are derived from this structured data and finally, the output is evaluated and interpreted. The main applications of text mining include competitive intelligence, E-Discovery, National Security, and social media monitoring. It is a trending topic for the thesis in data mining.
6. Fraud Detection – The number of frauds in daily life is increasing in sectors like banking, finance, and government. Accurate detection of fraud is a challenge. Da.
Just finished a basic course on data science (highly recommend it if you wish to explore what data science is all about). Here are my takeaways from the course.
Predictive Analytics: Context and Use Cases
Historical context for successful implementation of predictive analytic techniques and examples of implementation of successful use cases.
6 ijaems sept-2015-6-a review of data security primitives in data miningINFOGAIN PUBLICATION
This paper has discussed various issues and security primitives like Spatial Data Handing, Privacy Protection of data, Data Load Balancing, Resource Mining etc. in the area of Data Mining.A 5-stage review process has been conductedfor 30 research papers which were published in the period of year ranging from 1996 to year 2013. After an exhaustive review process, nine key issues were found “Spatial Data Handing, Data Load Balancing, Resource Mining ,Visual Data Mining, Data Clusters Mining, Privacy Preservation, Mining of gaps between business tools & patterns, Mining of hidden complex patterns.” which have been resolved and explained with proper methodologies. Several solution approaches have been discussed in the 30 papers. This paper provides an outcome of the review which is in the form of various findings, found under various key issues. The findings included algorithms and methodologies used by researchers along with their strengths and weaknesses and the scope for the future work in the area.
There are numerous ways to analyse the web information, generally web substance are housed in
large information sets and basic inquiries are utilized to parse such information sets. As the requests
expanded with time, mining web information amended to meet challenging task in a web analysis.
Machine learning methodologies are the most up to date one to go into these analysis forms. Different
approaches like decision trees, association rules, Meta heuristic and basic learning methods are embraced
for making web data appraisal and mining data from various web instances. This study will highlight these
approaches in perspective of web investigation. One of the prime goals of this exploration is to investigate
more data mining approaches alongside machine learning systems, and to express emerging collaboration
of web analytics with artificial intelligence.
An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases.
http://www.eventbrite.com/event/7476758185
1. Data journey modelling:
Predicting risk of IT developments
Iliada Eleftheriou
and
Suzanne M. Embury
Andy Brass
Principles of Enterprise Modelling
Nov 2016
5. Related work
• Current approaches
– mainly focused on detailed predictions based on
substantial models
– support project managers throughout the
development process, rather than
• give a low-cost indicator
• for use in early-stage decision making.
– Cocomo, Prince, UML, etc.
• Need for a lightweight approach, that gives
reliable predictions, and can be used early.
03/01/2017 Data journey modelling 5
6. Project aim
• To develop a method that:
– reliably predicts places of costs and risks,
– can be used in early stage decision making.
• Data journey model:
– Lightweight technique
– captures the journey of data through complex
networks of people and systems
– identifies socio-technical challenges in the journey
– Highlights places of high cost and risk
03/01/2017 Data journey modelling 6
7. Methods
18 case studies from the NHS domain
• Recent IT developments
• Only 3 successful
IT failure factors
• Technical, e.g. conflicting data formats, data silos
• Social: human and organisational related factors
Data movement:
a key indicator of failure
03/01/2017 Data journey modelling 7
8. Conceptual model
• Data movement anti-patterns:
– movement of data that under some circumstances
impose costs to the new development
03/01/2017 Data journey modelling 8
If the source stores the data in a
physical form, and the target
requests it in electronic, then a
transformation cost is implied to
either end of the movement.
data entry, injection of errors
• Administrative costs:
• Data sharing agreements
• Governance requirements
• Ethical issues
• Data islands
• Legacy systems
• Clash of grammar: dates,
experience, knowledge.
11. Operational model
Data Journey Model:
A. Landscape: existing journeys of data within an
organisational landscape, happening at any given time.
B. New journey: the data journey needed by the new
functionality.
A data journey landscape captures both the social and the
technical factors that can affect the journey of data.
03/01/2017 Data journey modelling 11
DATAjourney.org
12. Operational model
• A data journey, is a set of data movements
between containers.
• A journey leg moves data, through media.
• Actors interact with containers.
03/01/2017 Data journey modelling 12
DATAjourney.org
13. Predicting risk
03/01/2017 Data journey modelling 13
• Data movement anti – patterns: High cost and
risk occurred when data moved between actors
and containers with key discrepancies:
– Change of media (physical to electronic)
– Discontinuity (external organisation)
– Actor’s properties (clash of grammar)
• Need low cost ways to incorporate patterns.
– In some cases, information is readily available.
– Other factors, are less obvious (people’s vocabularies)
– Use of proxies
14. Predicting risk
• Group together the elements of the data
journey diagram with similar properties.
• Overlay groupings onto the landscape to form
boundaries.
03/01/2017 Data journey modelling 14
15. Evaluation
• Retrospective
evaluation
• Real world case study
• Results:
– Accurately predicted:
13 out of 19 predictions.
– Also, predicted 7 that
haven’t been found by
humans, but assessed
as feasible by domain
experts.
• http://datajourney.org/publications/
tech_rep_data_journey.pdf
03 January 2017 Iliada Eleftheriou 15
16. Conclusion
• Contributions:
– A set of 32 IT failure factors
– Data movement patterns
– Data journey model:
• Potentially identify opportunities for cost saving
• Next: Application on another case study
– Verify the set of boundaries on the genomics team
of the St Mary’s Hospital.
03 January 2017 Iliada Eleftheriou 16
17. 03/01/2017 Data journey modelling 17
Data journey modelling: Predicting risk for IT developments.
Iliada Eleftheriou
iliada.eleftheriou@manchester.ac.uk
DATAjourney.org
Editor's Notes
I am Iliada, I come from the UoM and now am on my fourth and final year.
My project investigates challenges and risks of moving data across contexts.
Today, I will be presenting our paper on how we conceived the data journey model;
a lightweight technique that assists in predicting risk for new IT developments.
Is bigger always better? In the context of modelling cost estimation of course.
Are bigger, more complex and more detailed cost estimation techniques always more preferable?
Often organisations have new requirements coming in, requiring new functionality to be implemented on top of an existing network of people, systems and data.
For example, 2 departments merging, requiring their data to be integrated,
Existing data needs to be shared with an external agency to create new value,
Or additional data needs to be shared with a consumer
Managers and stakeholders of these organisations will have to make a quick decision on whether is worth proceeding with the new development or not.
It might sound a simple decision, but in real life is a bit more complicated.
Here we see a drawing from the Kings Fund attempting to structure the National Health Service in the UK. As we can see,
organisations are larger and more complex with several sub-organisations and departments each with its own infrastructure, people, policies, governance and politics.
Experience shows us that integrating new functionality to an already crowded infrastructure causes things to go wrong.
Costs are often underestimated,
Projects are given up
And jobs are lost
So how can we make a go / no go decision in a defensible way, and avoid any newspaper headlines?
Ideally, we would search the literature for an off-the-shelve cost estimation technique.
Current approaches to managing risk and estimating the cost
are mainly focused on creating detailed predictions based on substantial models of the planned development.
They aim to support project managers throughout the development process, rather than giving a low-cost indicator for use in early-stage decision making.
Such approaches like COCOMO, PRINCE, I*, UML
are powerful and very useful but for later in the cycle. We might have only a day, a week or at most a month to take the decision.
The aim of our project is to help managers and stakeholders of large complex organisations to make better informed decisions on whether to proceed with a new development or not.
To do so, we developed a method that reliably predicts risk of new developments, that can be used in early stage decision making.
Following the agile methodology ( ), we came up with a rather simplistic model. The data journey model, is a
We analysed 18 case studies from the NHS domain.
Written by staff of the NHS and they describe recent IT developments
Surprisingly, only 3 out of the 18 studies were categorised by the authors as having been successful.
The rest were described as having (completely or partly) failed to deliver the expected benefits.
B. We looked for factors influencing the success and failure of the newly introduced development in an existing setting
And we extracted a set of 32 factors that contributed to the failure of the developments.
We found Not just technical issues of e.g. heterogeneous data sources, but also a majority of social, people and organisational related, factors like:
Res. To change, Lack of shared vision
Governance and ethical issues
C. A form of data movement, either between people, systems, and organisations was a key indicator of failure.
Finally, we went through the case studies again and derived generic data movement anti patterns to serve as early warning signs of failure in a new development.
Data entry is a time consuming process typically done by clerical staff, who may not have a strong understanding of the meaning of the data they are entering.
Errors can easily be injected that may significantly reduce the quality of the information.
We found 8 anti-patterns so far. Of course is not a complete and final list. But it can get us started. I explain each of them in the paper in more detail.
But we can’t just consult managers to avoid any movement of data.
Hence, we propose the data journey model, based on the patterns, assists managers to predict risk.
But let’s begin with an example. Let’s imagine we go to our local doctor, the GP to request a blood test.
Example used: A GP requests blood test results from a pathology lab. A new external agency requires demographics data from the pathology lab to make workload sharing more effective.
So, lets design a data journey model.
As I mentioned before, the djm models the journey / movement of data within and across orgs.
Having modelled the existing journeys of the data and the new one of the new functionality, we can predict places of the journey that may impose high costs and risks to the new development.
From data movement anti-patterns, we found that high cost and risk occurred when data moved between actors and containers with some key discrepancies:
Change of media
Discontinuity (external organisation)
Actor’s properties (clash of grammar): salary band proxy
We need low cost ways of incorporating these factors into the data journey model. In some cases, the information is readily available (like whether a container stores data in physical or electronic form).
However, other factors, like people’s vocabularies, are less obvious. For these factors we use a proxy; some piece of information which is cheap to apply, and approximates the same relationship between the actors and containers as by the original factor. For example, we use salary bands as a proxy indicator for the presence of “clash of grammars”, on the grounds that a large difference in salary bands between actors probably indicates a different degree of technical expertise.
To identify the places in which the above factors may impose costs, we group together the elements of the data journey diagram with similar properties. These groupings are overlaid onto the landscape of the data journey model and form boundaries. The places where a journey leg crosses from one grouping into another are the predicted location of the cost/risk introduced by the external organisational factor.
As we can see, the model doesn’t only predicts high cost places of the new functionality, but also of the existing landscape.
The list of the places suggesting to managers a further investigation on the costs that can happen.
We did evaluated our model, though is part of another paper.
Our methodology can potentially be used to identify opportunities for cost saving in an existing system, as well as predicting costs and risks of new developments.
Also, the methodology may be used to assess organisational readiness for various compliance programmes, such as clinical guidelines for management of chronic conditions, like diabetes.
The guidelines can be modelled as sets of data journeys to check whether the organisation follows or not.
If the organisation does not implement a data journey guideline will show the cost of compliance to the organisation.
For any questions or further clarifications, please don’t hesitate to contact me. My email is: iliada.eleftheriou@manchester.ac.uk