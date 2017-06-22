IBM Analytics Platform Group Enterprise Graph Analytics Enterprise large scale graph analytics and computing base on distr...
2© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system bas...
3© IBM 2017 Hadoop Summit 2017 Hybrid data analytics and challenges How was “total quantity” calculated? Show me the linea...
4© IBM 2017 Hadoop Summit 2017 How to handle the challenges? DataGovernance Data Lifecycle Management Data Quality Managem...
5© IBM 2017 Hadoop Summit 2017 What is Metadata? • The data used to describe other data − Simple Metadata − Rich Metadata ...
6© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system bas...
7© IBM 2017 Hadoop Summit 2017 Advantage of Graph in Metadata management Traditional solution • Limited in one server/syst...
8© IBM 2017 Hadoop Summit 2017 Property Graph Key1:value1 Key2:value2 Key1:value1 Key2:value2 Label Edge Properties Vertex...
9© IBM 2017 Hadoop Summit 2017 Using Graph Analytics to Find Complex Patterns 1st degree relationship 2nd degree relations...
10© IBM 2017 Hadoop Summit 2017 Case study - Audit data access • Data theft risk in enterprise in hybrid – Most data stole...
11© IBM 2017 Hadoop Summit 2017 Enterprise data quality analytics system based on graphed metadata Data ingest finance dat...
12© IBM 2017 Hadoop Summit 2017 Data ingest user programData Run Read name, job id, params, config, inputs, outputs, start...
13© IBM 2017 Hadoop Summit 2017 Feature Selection Who read the sensitive sales data in non-working time? Query: userFeaSel...
14© IBM 2017 Hadoop Summit 2017 Modeling • Modeling risk analysis with graphed metadata, information in ERP. • Analyze the...
15© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system ba...
16© IBM 2017 Hadoop Summit 2017 User data Machine data log data Behavioral data Graphed metadata Enterprise data quality s...
17© IBM 2017 Hadoop Summit 2017 How to choose Enterprise Graph Database? Data storing features Operation and manipulation ...
18© IBM 2017 Hadoop Summit 2017 Titan • What is Titan − Distributed Graph Database − Based on TinkerPop (Gremlin) − Open S...
19© IBM 2017 Hadoop Summit 2017 Titan solution architecture application Management API TinkerPop API - Gremlin Internal AP...
20© IBM 2017 Hadoop Summit 2017 Backend – HBase & Solr • HBase − Tight integration with the Hadoop ecosystem. − Native sup...
21© IBM 2017 Hadoop Summit 2017 Integration and management Titan in Ambari Titan Deployment Installation Uninstallation Ti...
22© IBM 2017 Hadoop Summit 2017 Remote Titan service Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O S...
23© IBM 2017 Hadoop Summit 2017 Cluster Remote Titan clientTitan server Titan security enhancement Spark Gremlin Graph Com...
24© IBM 2017 Hadoop Summit 2017 Integrate TinkerPop SparkGraphComputer with Titan DB Mgmt API TP API - Gremlin Internal AP...
25© IBM 2017 Hadoop Summit 2017 Open source Graph Database A new Linux Foundation project formed to continue development o...
26© IBM 2017 Hadoop Summit 2017 References & Contacts • Graph − Titan: http://titan.thinkaurelius.com − JanusGraph: http:/...
27© IBM 2017 Hadoop Summit 2017 zzzz z z z Thanks! Questions?
Upcoming SlideShare
Loading in …5
×

Enterprise large scale graph analytics and computing base on distribute graph database (TItan DB Hbase/Solr) and distribute graph computing in memory (TInkerPop Hadoop Gremlin sparkgraphcomputer) and Hadoop2

37 views

Published on

Graph approaches to structuring, analyzing data have been a significant area of interest, Graphs are well-suited to expressing complex interconnections and clusters of highly related entities.
Large-scale graph analytics research is growing fast in recent years, to leverage Hadoop2 ecosystem for graph is a good approach, enterprise graph computer requires to store large graph and do fast computing against graph. One for the OLTP database systems which allow the user to query the graph in real-time, Hbase as the distributed NOSql database can be the backend storage to persistent large graph, the property graph stored its vertices and edges in key-value pairs in Hbase, it also provide highly reliable, scalable and fault tolerant to the data, Solr as the distributed indexing will make the query more efficient. Titan itself will handle cache, transaction; And another for the OLAP analytics systems, use TinkerPop hadoop gremlin SparkGraphComputer to processed a large graph, every vertex and edge is analyzed, a cluster-computing platform will help for the processing of large distributed in memory graph datasets.
Graph DB base on Hbase/Solr and graph computing analysis base on spark is powerful for discovering valuable information about relationships in complex and large data, representing significant business opportunity in enterprise. It will help graph data analytics in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
no profile picture user

  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
37
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enterprise large scale graph analytics and computing base on distribute graph database (TItan DB Hbase/Solr) and distribute graph computing in memory (TInkerPop Hadoop Gremlin sparkgraphcomputer) and Hadoop2

  1. 1. IBM Analytics Platform Group Enterprise Graph Analytics Enterprise large scale graph analytics and computing base on distribute graph database(Titan DB HBase/Solr) and distributed graph computing in memory(TinkerPop Hadoop Gremlin SparkGraphComputer) and Hadoop2 • Jun(Terry) Yang • yangjuncn@cn.ibm.com • Jing Chen(Jerry) He • jinghe@us.ibm.com • Hadoop Summit 2017 SAN JOSE, USA JUNE 13-15
  2. 2. 2© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system based on graphed metadata • Graph in enterprise data quality analytics solution
  3. 3. 3© IBM 2017 Hadoop Summit 2017 Hybrid data analytics and challenges How was “total quantity” calculated? Show me the lineage? What are the source-to-target mappings for the DW? Who read the “sales” data in non-working time? How to ensure data quality? Data Warehouse Architect Auditor Business Person Data Architect
  4. 4. 4© IBM 2017 Hadoop Summit 2017 How to handle the challenges? DataGovernance Data Lifecycle Management Data Quality Management •Correctness Consistency Completeness Timeliness Metadata … Master Data management …
  5. 5. 5© IBM 2017 Hadoop Summit 2017 What is Metadata? • The data used to describe other data − Simple Metadata − Rich Metadata • inode attributes for file management • Filesystem object attributes include metadata, like modify time, access, owner, permission, etc. File systems metadata • Schema for data management • Ownership information of data • Server/Database information of data DBMS/DW/NOSQL metadata How to manage the metadata cross platform/system/server?
  6. 6. 6© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system based on graphed metadata • Graph in enterprise data quality analytics solution
  7. 7. 7© IBM 2017 Hadoop Summit 2017 Advantage of Graph in Metadata management Traditional solution • Limited in one server/system • Metadata managed within a server/system Property Graph based solution • Integrate metadata • Handle storage pressure • Efficient Processing and Querying • Lineage • Wild range managed
  8. 8. 8© IBM 2017 Hadoop Summit 2017 Property Graph Key1:value1 Key2:value2 Key1:value1 Key2:value2 Label Edge Properties Vertex G = ( V, E ) Graph Vertices Edges label1 • Born for relationship • Intuitive modeling • Expressive querying • Native analysis
  9. 9. 9© IBM 2017 Hadoop Summit 2017 Using Graph Analytics to Find Complex Patterns 1st degree relationship 2nd degree relationship 3rd degree relationship • Graph queries are a natural way for analyzing relationship patterns  Less complex than SQL  Can handle high degrees of relationship with ease • Graph schema facilitates visualization and exploration of relationships
  10. 10. 10© IBM 2017 Hadoop Summit 2017 Case study - Audit data access • Data theft risk in enterprise in hybrid – Most data stolen by internal person. – Most data theft happened in non-working time. – Over-granting of privileges may cause data theft.
  11. 11. 11© IBM 2017 Hadoop Summit 2017 Enterprise data quality analytics system based on graphed metadata Data ingest finance data Consumption data Credit data Behavioral data Graphed metadata … Feature Selection Statistical learning Data analysis (Graphed) Metadata analysis … Advanced Feature Selection Gradient Boosting Decision Tree Support Vector Machine Random Forests PageRank(Graph) … Modeling Customer risk rating Consumption Capacity Graph model … Recommendation Consumer behavior Fraud detection Risk analytics(Audit) …
  12. 12. 12© IBM 2017 Hadoop Summit 2017 Data ingest user programData Run Read name, job id, params, config, inputs, outputs, start_ts, finish_ts, … id, name, group, permission, … name, size, location, department, permission, parent, children, … ts_hour, ts_min, ts_sec, status, … Metadata Integration Graph-based Traversal • User • Program • Data • … •Entitles  Vertices • User run program • Program read data • … Relationships  Edges • Name • …. Attributes  Properties Identify entities and relationships Metadata to Graph
  13. 13. 13© IBM 2017 Hadoop Summit 2017 Feature Selection Who read the sensitive sales data in non-working time? Query: userFeaSele = graph.traversal(). V().has("department","sales").inE("read").outV().hasLabel('progra m').inE("run").has(“ts_hour",not(within(9,17))).outV() Find the user who has the access to large amount data? Query: … withComputer(SparkGraphComputer) … userAdvFeaSele = userFeaSele.pageRank().by('pageRank').order().by('pageRank').li mit(30) FeatureSelection AdvancedFeature Selection
  14. 14. 14© IBM 2017 Hadoop Summit 2017 Modeling • Modeling risk analysis with graphed metadata, information in ERP. • Analyze the user with employee information from ERP, with years of working, age, role, to identify suspect. A non-sales person, for example, an application R&D person, will be the suspect. • Audit Recommendation. Risk analysis model Graph: User List(userAdvFeaSele) ERP: Employee information ERP: Violation information Audit Recommendation Risk analysis report Suspects who stole sensitive data Advanced Feature Selection Other system
  15. 15. 15© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system based on graphed metadata • Graph in enterprise data quality analytics solution
  16. 16. 16© IBM 2017 Hadoop Summit 2017 User data Machine data log data Behavioral data Graphed metadata Enterprise data quality system Feature analysis Lineage Metadata management Cleansing Hadoop Hbase Hive HDFS Spark Titan Solr … Data Source third-party data Ingest(load) Business Application Risk management Data audit Graph in enterprise data quality analytics solution …… Cost analytics
  17. 17. 17© IBM 2017 Hadoop Summit 2017 How to choose Enterprise Graph Database? Data storing features Operation and manipulation features Graph data structures Query features Schema and instance representation Easy and centralized Management Expose service Security features Fast computing Evaluate Graph database from following perspective:
  18. 18. 18© IBM 2017 Hadoop Summit 2017 Titan • What is Titan − Distributed Graph Database − Based on TinkerPop (Gremlin) − Open Source • Titan Features − Distribute − Scalable : billions edges and vertices − Real-time − Transactional database (concurrent users/ACID/..) − Global graph compute: graph data analytics, report, ETL − Search: geo, numeric range, and full text search
  19. 19. 19© IBM 2017 Hadoop Summit 2017 Titan solution architecture application Management API TinkerPop API - Gremlin Internal API layer Database layer(Tx, Data, Mgmt, Optimizer) OLAPI/O Interface Storage and Index Interface Layer HBase Storage Backend Solr External Index Backend Spark Big Data Platform Gremlin GraphComputer OLAP OLTP Hadoop  Optimized for storing and querying billions of vertices and edges over a cluster  Supports thousands of concurrent users  Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)
  20. 20. 20© IBM 2017 Hadoop Summit 2017 Backend – HBase & Solr • HBase − Tight integration with the Hadoop ecosystem. − Native support for strong consistency. − Linear scalability with the addition of more machines. − Strictly consistent reads and writes. − Convenient base classes for backing Hadoop MapReduce jobs with HBase tables. − Support for exporting metrics via JMX. − Open source under the liberal Apache 2 license. • Solr − Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. − Solr is a standalone enterprise search server with a REST-like API. − Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation Easy and centralized Management Expose service Security features Fast computing
  21. 21. 21© IBM 2017 Hadoop Summit 2017 Integration and management Titan in Ambari Titan Deployment Installation Uninstallation Titan client deployment Titan server deployment Titan server operation Start server Stop server Service check Titan Configuration HBase backend Solr backend SparkGraphComputer Titan server Titan environment Titan security Titan security support SSL SASL LDAP Kerberos Knox HBase Access control  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management Expose service Security features Fast computing
  22. 22. 22© IBM 2017 Hadoop Summit 2017 Remote Titan service Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O Storage and Index Interface Layer HBase Solr Spark Gremlin GraphComputer Gremlin Server Gremlin Console Titan Engine {RESTful} {Web Socket} Gremlin> local Titan server Titan client  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management  Expose service Security features Fast computing
  23. 23. 23© IBM 2017 Hadoop Summit 2017 Cluster Remote Titan clientTitan server Titan security enhancement Spark Gremlin Graph Computer local Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O Interface Storage and Index Interface Layer HBase Solr SSL Knox SASL LDAP/OS /Kerberized Titan user HBase Access control Kerberized Cluster Security Description  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management  Expose service  Security features Fast computing
  24. 24. 24© IBM 2017 Hadoop Summit 2017 Integrate TinkerPop SparkGraphComputer with Titan DB Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O Interface Storage and Index Interface Layer HBase Solr Gremlin GraphComputer Graph RDD PageRankVertexProgram PeerPressureVertexProgram BulkDumperVertexProgram BulkLoaderVertexProgram TraversalVertexProgram Spark-gremlin SparkGraphComputer Hadoop gremlin Spark  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management  Expose service  Security features  Fast computing
  25. 25. 25© IBM 2017 Hadoop Summit 2017 Open source Graph Database A new Linux Foundation project formed to continue development of the TitanDB graph database. Last Titan 1.0.0 was release on Sep 20 2015
  26. 26. 26© IBM 2017 Hadoop Summit 2017 References & Contacts • Graph − Titan: http://titan.thinkaurelius.com − JanusGraph: http://janusgraph.org − TinkerPop: https://tinkerpop.apache.org Jun(Terry) Yang Team Leader yangjuncn@cn.ibm.com Linkedin.com/in/terryjunyang Jing Chen(Jerry) He Architect jinghe@us.ibm.com Linkedin.com/in/jing-chen-jerry-he-1553511
  27. 27. 27© IBM 2017 Hadoop Summit 2017 zzzz z z z Thanks! Questions?

×