This document proposes a software architecture to address complex and dynamic data modeling challenges. The proposed solution has four main components: [1] An OSGi-based architecture for modularity, reusability and dynamic updates. [2] A graph database (Neo4j) to flexibly store relationships and enable natural queries. [3] A user interface built with AngularJS and D3.js for rich, data-driven visualization. [4] The use of a "mad developer" to implement the architecture. The architecture aims to reduce complexity, support dynamic data and provide a flexible yet user-friendly interface.
Recent presentation on deeplearning4j's new features as well as some underused features of the AI framework like arbiter,datavec's transform process and libnd4j.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
One of the main advantages of PHP is that it allows you and your company to build up projects in no time and with immediate feedback and business value. Sometimes, however, fast growth and unprevented complexities could make your codebase more and more difficult to manage as time passes and new features are added.Domain Driven Design can be an elegant solution to the problem, but introducing it in mid-large sized projects is not always easy: you have to deal with difficulties at technical, team and knowledge levels. This talk focuses on how to approach the change in your codebase and in your team mindset without breaking legacy code or stopping the development in favor of neverending refactoring sessions.
Recent presentation on deeplearning4j's new features as well as some underused features of the AI framework like arbiter,datavec's transform process and libnd4j.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
One of the main advantages of PHP is that it allows you and your company to build up projects in no time and with immediate feedback and business value. Sometimes, however, fast growth and unprevented complexities could make your codebase more and more difficult to manage as time passes and new features are added.Domain Driven Design can be an elegant solution to the problem, but introducing it in mid-large sized projects is not always easy: you have to deal with difficulties at technical, team and knowledge levels. This talk focuses on how to approach the change in your codebase and in your team mindset without breaking legacy code or stopping the development in favor of neverending refactoring sessions.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Paolo Kreth - Persistence layers for microservices – the converged database a...matteo mazzeri
This talk will present the difference between a polyglot persistence and a converged database approach in mapping data for microservices. An historical point of views will lead us in understanding the difficulties in operating different databases and stores and the repercussions operational bottlenecks have on development.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
Databases have been around for decades and were highly optimised for data aggregations during that time. Not only Big data has changed the landscape of databases massively in the past years - we nowadays can find many Open Source projects among the most popular dbs.
After this talk you will be enabled to decide if a database can make your work more efficient and which direction to look to.
Graph Databases in the Microsoft EcosystemMarco Parenzan
With SQL Server and Cosmos Db we now have graph databases broadly available, after being studied for decades in Db theory, or being a niche approach in Open Source with Neo4J. And then there are services like Microsoft Graph and Azure Digital Twins that give us vertical implementations of graph. So let's make a walkaround of graphs in the MIcrosoft ecosystem.
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
Abstract: How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.
Apache Ignite: In-Memory Hammer for Your Data Science ToolkitDenis Magda
Machine learning is a method of data analysis that automates the building of analytical models. By using algorithms that iteratively learn from data, computers are able to find hidden insights without the help of explicit programming. These insights bring tremendous benefits into many different domains. For business users, in particular, these insights help organizations improve customer experience, become more competitive, and respond much faster to opportunities or threats.
The availability of very powerful in-memory computing platforms, such as Apache Ignite, means that more organizations can benefit from machine learning today. In this presentation, we will discuss how the Compute Grid, Data Grid, and Machine Learning Grid components of Apache Ignite work together to enable your business to start reaping the benefits of machine learning. Through examples, attendees will learn how Apache Ignite can be used for data analysis and be the in-memory hammer in your machine learning toolkit.
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
Solution Use Case Demo: The Power of Relationships in Your Big DataInfiniteGraph
In this security solution demo, we have integrated Oracle NoSQL DB with InfiniteGraph to demonstrate the power of using the right tools for the solution. By integrating the key value technology of Oracle with the InfiniteGraph distributed graph database, we are able to create new views of existing Call Detail Record (CDR) details to enable discovery of connections, paths and behaviors that may otherwise be missed.
Discover how to add value to your existing Big Data to increase revenues and performance!
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Paolo Kreth - Persistence layers for microservices – the converged database a...matteo mazzeri
This talk will present the difference between a polyglot persistence and a converged database approach in mapping data for microservices. An historical point of views will lead us in understanding the difficulties in operating different databases and stores and the repercussions operational bottlenecks have on development.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
Databases have been around for decades and were highly optimised for data aggregations during that time. Not only Big data has changed the landscape of databases massively in the past years - we nowadays can find many Open Source projects among the most popular dbs.
After this talk you will be enabled to decide if a database can make your work more efficient and which direction to look to.
Graph Databases in the Microsoft EcosystemMarco Parenzan
With SQL Server and Cosmos Db we now have graph databases broadly available, after being studied for decades in Db theory, or being a niche approach in Open Source with Neo4J. And then there are services like Microsoft Graph and Azure Digital Twins that give us vertical implementations of graph. So let's make a walkaround of graphs in the MIcrosoft ecosystem.
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
Abstract: How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.
Apache Ignite: In-Memory Hammer for Your Data Science ToolkitDenis Magda
Machine learning is a method of data analysis that automates the building of analytical models. By using algorithms that iteratively learn from data, computers are able to find hidden insights without the help of explicit programming. These insights bring tremendous benefits into many different domains. For business users, in particular, these insights help organizations improve customer experience, become more competitive, and respond much faster to opportunities or threats.
The availability of very powerful in-memory computing platforms, such as Apache Ignite, means that more organizations can benefit from machine learning today. In this presentation, we will discuss how the Compute Grid, Data Grid, and Machine Learning Grid components of Apache Ignite work together to enable your business to start reaping the benefits of machine learning. Through examples, attendees will learn how Apache Ignite can be used for data analysis and be the in-memory hammer in your machine learning toolkit.
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
Solution Use Case Demo: The Power of Relationships in Your Big DataInfiniteGraph
In this security solution demo, we have integrated Oracle NoSQL DB with InfiniteGraph to demonstrate the power of using the right tools for the solution. By integrating the key value technology of Oracle with the InfiniteGraph distributed graph database, we are able to create new views of existing Call Detail Record (CDR) details to enable discovery of connections, paths and behaviors that may otherwise be missed.
Discover how to add value to your existing Big Data to increase revenues and performance!
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
11. Architecture – we aim to
• Reduce the complexity
• To be reusable
• Easy in deployment
• Allows dynamic updates
• To be adaptive
• Fast in responses
• Low memory profile
• To provide security
• …
16. Architecture – In summary, OSGi goals are …
Service Oriented +
Modular (bundles)
Bundle (x)
Service (x’)
Service (y)
Service (x)
17. Architecture – OSGi : Simple overview
Console Logging Admin …
Web Server
WAB
Application
1
WAB
Application
2
…
Application
Service 1
Application
Service 2
…
…
OSGi Instance 1
JVM
…
Bundles to be
developed for us
Bundles to be
installed
24. Data – Graph Databases – Why?
Flexible data structure
Doesn’t matter if the relations will change in the future.
Closer match to business logic
25. Data – Graph Databases – Why?
Natural query system
You tell what you want, not how to get it.
with recursive cluster (party, path, depth)
as ( select cast(@userId as character varying),
cast(@userId as character varying), 1
union
(
select (case
when this.party = amc.userA then amc.userB
when this.party = amc.userB then amc.userA
end), (this.path || '.' || (case
when this.party = amc.userA then amc.userB
when this.party = amc.userB then amc.userA
end)), this.depth + 1
from cluster this, chat amc
where ((this.party = amc.userA and
position(amc.userB in this.path) = 0)
or (this.party = amc.userB and position(amc.userA
in this.path) = 0)) AND this.depth < @depth + 1 )
)
select party, path
from cluster
where not exists (
select *
from cluster c2 where cluster.party = c2.party
and (
char_length(cluster.path) > char_length(c2.path)
or (char_length(cluster.path) =
char_length(c2.path)) and (cluster.path > c2.path)
)
)
order by party, path;
SQL = several hours to be executed
VS
START b = node:User(UserId=‘Manolo')
MATCH (b) --(friend)--(friendoffriend)
RETURN count(friendoffriend)
Cypher Language = 635ms
26. Data - Graph Databases – Why?
Fits very well with complex data
27. Data - Graph Databases – Why?
Fits very well with Bio-Informatics
0.9 Billion
relationsips
28. Data – Graph Databases – Why?
Fast Prototyping and development
We don’t need to lose too much time to define the schema (fine-grained).
29. Data - Graph Databases – What is it?
Properties
Labels
Relationships
38. GUI
UI Graphs
Model / View / Controller
( on Browser using Jscript )
JAX-RS (RESTful web services)
JSON responses
On OSGi bundle as a webservice
On Browser client
Data Driven Documents
thanks for attending this presentation, I hope that it covers your expectations.
This is only a high level description about the reasons to choose the selected architecture and their tools… Therefore, Do Not hesitate to interrupt me if you have any question, I will glad to explain in more details anything, if this allows to you to understand much better the final solution.
Data is complex from their definition, too many relationships between different nodes and different domains.
To fit from the „real“ world to an standard Entity Relational Model is a nightmare and it‘s a focus of errors if something need to be changed in the future (to introduce new properties, new objects, new relationships, etc. )
The important thing from any design is to acquire correctly at least the 99% of the User requirements, but it‘s impossible if the user generate uncertainly from different reasons (and also when exists different users with different domains or points of view).
One part where all software solutions spent more time is to developing the User Interface. Need to be flexible and adaptable from different requirements and uses or at least, that the technnology used provide the most easy way to create a good user experiences.
INTUITIVE
Need to be flexible and adaptable from different requirements based in different platforms to be used.
To select the correct technology is also the goal to create a success project.
Not all is based only in the front-end, and also, not all is based in the backend.
It‘s our solution, we known that is possible to do it using different approachs ... All roads lead to Rome, but some are more easy than others
A good solution is never easy to do...but if it is simple, it‘s much better.
A good solution is never easy to do...but if it is simple, it‘s much better.
Some of them are oriented as a wen server applications, but others are more service oriented.
Each module/bundle is a service that publish to the others some functionallity using the OSGi framework where they are living.
A good solution is never easy to do...but if it is simple, it‘s much better.
It‘s a complement, this technology appears several years ago...but the last years was impossed by the requirements about the scalability, clustering and performance.
In difference with the RDBMS, the implementation for each solution differs sometimes between these solutions because each solution is based in another paradigme and focused in different perspectives based on different types of organization data.
It‘s a complement, this technology appears several years ago...but the last years was impossed by the requirements about the scalability, clustering and performance.
- Data is according with the mind of the expert area (ex: Lab. people) and not with the mind of the IT Expert area.
Good reference: http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/