Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Vital AI: Big Data Modeling



Video: https://www.youtube.com/watch?v=Rt2oHibJT4k ...

Video: https://www.youtube.com/watch?v=Rt2oHibJT4k

Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.

This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:

Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources



Total Views
Views on SlideShare
Embed Views



3 Embeds 76

http://inc.ogni.to 73
http://www.slideee.com 2
https://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Vital AI: Big Data Modeling Presentation Transcript

  • 1. Big Data Modeling Today: Marc C. Hadfield, Founder
 Vital AI
 http://vital.ai marc@vital.ai 917.463.4776
  • 2. intro Marc C. Hadfield, Founder Vital AI
  • 3. Big Data Modeling is
 Data Modeling with the “Variety” Big Data Dimension in mind…
  • 4. Big Data “Variety” Dimension The “Variety” problem can be addressed by a combination of improved tools and a methodology involving both system architecture and
 data science / analysis. Compared to Volume and Velocity, Variety is a very labor -intensive human-centric process. Variety is the many types of data to be utilized together in a data-driven application.
 Potentially too many types for any single person to keep track of (especially in Life Sciences).
  • 5. Key Takeaways: Using OWL as a “meta-schema” can drastically reduce operations/development effort and increase the value of the data for analysis. OWL can augment and not replace familiar development processes and tools. A huge amount of ongoing development effort is spent transforming data across components and keeping data consistent during analysis. Collecting Good Data = Good Analytics
  • 6. Big Data Modeling: Challenges Goals OWL as Modeling Language Using OWL-based Models… Collaboration/Modeling Tools
  • 7. Examples from NYC Department of Education: Domain Ontology Application Architecture Development Methodology/Tools
  • 8. NYC Department of Education:
  • 9. Data Architecture Data Science Data Models in:
  • 10. Challenges
  • 11. Mobile/Web App Architecture Data Model Data Model Mobile App Server Implementation Database
  • 12. Database Master Database "Data Lake" Database Database Database Business Intelligence Data Analytics Dashboard Enterprise DataWarehouse Architecture Schema “on read” or “on write” Data Model Data Model Data Model Data Model Data Model ETL Process
  • 13. MobileApp Server Layer Real Time Data Calculated Views Hadoop Predictive Analytics Master Database "Data Lake" Business Intelligence Data Analytics Dashboard Lambda Architecture + Hadoop: Data Driven App Data Model Data Model Data Model Data ModelData Model
  • 14. Data Wrangling / Data Science Master Database "Data Lake" Business Intelligence Data Analytics Raw Data R Data Model Data Model Data Model Prediction Models must integrate back with production environment:
  • 15. Same Data, Difference Contexts… Redundant Models.
  • 16. Data Architecture Issues { Database Schema
 JSON Data
 Data Object Classes
 Avro/Parquet Redundant Data Definitions: Considerable Development / Maintenance / Operational Overhead
  • 17. Data Science / Data Wrangling Issues 
 Data Harmonization: Merging Datasets from Multiple Sources
 Loss of Context: Feature f123 = Column135 X Column45 / Column13
 Side note: Let’s stop using CSV files for datasets!
 No more flat datasets!
  • 18. Goals
  • 19. Goals: Reduce redundancy in Data Definitions Enforce Clean/Harmonized Data
 Use Contextual Datasets Use Best Software Components (Databases, Analytics, …) Use Familiar Tools (IDE, git, Languages, R)
  • 20. OWL as Modeling Language
  • 21. Web Ontology Language (OWL) Specifies an Ontology (“Data Model”) Formal Semantics, W3C Standard Provides a language to describe the meaning of data properties and how they relate to classes. Example: Mammal
 Necessary Conditions: warm-blooded, vertebrate animal, has hair or fur, secrets milk, (typically) live birth Greater descriptive power than Schema (SQL Tables) and Serialization Frameworks (Avro)
  • 22. Why OWL? If we can more formally specify what the data *means*, then we can have a single data model (ontology) apply to our entire architecture, and data can be transformed automatically locally as per the needs of a specific software module. Manually coded data transforms may be “lossy” and/or introduce errors, so eliminating them helps keep data clean.
  • 23. Why OWL? (continued) Example: if we specify what a “Document” is, then a text- mining analyzer will know how to access the textual data without further prompting. Example: if we specify Features for use in Machine Learning in the ontology, then features can be generated automatically to train Machine Learning Models, and the same features would be generated when we use the model in production.
  • 24. Why OWL? (continued) Note: As ontologies can extend other ontologies, rather than a single ontology, a collection of linked ontologies can be used, allowing segmentation across an organization.
  • 25. Vital Core Ontology Protege Editor… Nodes, Edges, HyperNodes, HyperEdges get URIs John/WorksFor/IBM —> Node / Edge / Node
  • 26. Vital Core Ontology Vital Domain Ontology Application Domain Ontology Extending the Ontology
  • 27. NYC Dept of Education Domain Ontology
  • 28. Generating Data Bindings with VitalSigns: Ontology VitalSigns Groovy Bindings Semantic Bindings Hadoop Bindings Prolog Bindings Graph Bindings HBase Bindings JavaScript Bindings Code/Schema Generation vitalsigns generate -ont name…
  • 29. person123.name = "John" person123.worksFor.company456 <person123> <hasName> "John" <worksFor123> <hasSource> <person123> <worksFor123> <hasDestination> <company456> <worksFor123> <hasType> <worksFor> person123, Node:type=Person, Node:hasName="John" worksFor123, Edge:type=worksFor, Edge:hasSource=person123, Edge:hasDestination=company456 Groovy RDF HBase Data Representations
  • 30. VitalSigns Generation —> JAR Library Runtime Domain Ontology Domain Ontology Domain Ontology Domain Ontology VitalSigns Class
  • 31. Using OWL-based Models
  • 32. Developing with the Ontology in UI, Hadoop, NLP, Scripts, ... Node:Person Node:PersonEdge:hasFriend Set<Friend> person123.getFriends() Eclipse IDE
  • 33. // Reference to an NYCSchool object NYCSchool school123 = … // get from database
 ! // Get a list of programs, local context (cache) List<NYCSchoolProgram> programs = school123.getPrograms() ! // Get list of programs, global context (database) List<NYCSchoolProgram> programs = school123.getPrograms(Context.ServiceWide) ! JVM Development
  • 34. Using JSON-Schema Data in JavaScript for(var i = 0 ; i < progressReports.length; i++) { var r = progressReports[i]; var sub = $('<ul>'); sub.append('<li>Overall Grade : ' + r.progReportOverallGrade + '</li>'); sub.append('<li>Progress Grade: ' + r.progReportProgressGrade + '</li>'); sub.append('<li>Environment Grade: ' + r.progReportEnvironmentGrade + '</li>'); sub.append('<li>College and Career Readiness Grade: ' + r.progRepCollegeAndCareerReadinessGrade+ '</li>'); sub.append('<li>Performance Grade: ' + r.progReportPerformanceGrade+ '</li>'); sub.append('<li>Closing the Achievement Gap Points: ' + r.progReportClosingTheAchievementGapPoints+ '</li>'); sub.append('<li>Percentile Rank: ' + r.progReportPercentileRank + '</li>'); sub.append('<li>Overall Score: ' + r.progReportOverallScore + '</li>'); }
  • 35. NoSQL Queries Query API / CRUD Operations
 ! Queries generated into “native” NoSQL Query format: Sparql / Triplestore (Allegrograph) HBase / DynamoDB MongoDB Hive/HiveQL (on Spark/Hadoop2.0) Query Types: “Select” and “Graph” Abstract type of datastore from application/analytics code Pass in a “native” query when necessary
  • 36. Data Serialization, Analytics Jobs Data Serialized into file format by blocks of objects Leverage Hadoop Serialization Standards: Sequence File, Avro, Parquet Get data in and out of HDFS Files Spark/Hadoop jobs passed a set of objects as input
 URI of object is key Data Objects are serialized into Compressed Strings for transport over Flume, etc.
  • 37. Machine Learning Via Hadoop, Spark, R Mahout, MLLib Build Predictive Models Classification, Clustering... Use Features defined in Ontology Learn Target defined in Ontology Models consume Ontology Data as input
  • 38. Natural Language Processing/Text Mining Topic Categorization… Extract Entities… Text Features from Ontology Classes extending Document…
  • 39. Graph Analytics GraphX, Giraph: PageRank, Centrality, Interest Graph, …
  • 40. Inference / Rules Use Semantic Web Rule Engines / Reasoners
 ! Load Ontology + RDF Representation of Data Instances (Individuals)
  • 41. R Analytics Load Serialized Data into R Dataframes ! Reference Classes and Properties by Name in Dataframes (cleaner code than huge number of columns)
  • 42. Graph Visualization with Cytoscape Data already in Node/Edge Graph Form
  • 43. Graph Visualization with Cytoscape
  • 44. Visualize Data “Hot Spots”
  • 45. NYC Schools Architecture Mobile App JSON Schema VertX Vital Flow Queue Rule Engine NLP DynamoDB Vital Prime VitalService Client NYC Schools Data Model R Serialized Data Data Insights
  • 46. Collaboration & Tools
  • 47. Collaboration/Tools git - code revision system OWL Ontologies treated as code artifact Coordinate across Teams:
 “Front End”, “Back End”, “Operations”, “Business Intelligence”, “Data Science”… Coordinate across Enterprise: Departments / Business Units “Data Model of Record”
  • 48. Ontology Versioning NYCSchoolRecommendation-0.1.8.owl Semantic Versioning (http://semver.org/)
  • 49. vitalsigns command line vitalsigns generate vitalsigns upversion/downversion code/schema generation increase version patch number
 move previous version to archive
 rename OWL file including username JAR files pushed into Maven
 (Continuous Integration)
  • 50. Git Integration git: add, commit, push, pull diff: determine differences merge: merge two Ontologies detect types of Ontology changes merge into new patch version
  • 51. OWL as Data Modeling Language:
 Data Architecture & Data Science / Analytics Conclusions Leverage Existing Tools, Components Reduce model redundancy, reduce effort. A Means to Collaborate Across Teams: Data Model of Record Cleaner Data Integrate additional analysis
  • 52. For more information, please contact: Marc C. Hadfield http://vital.ai marc@vital.ai 917.463.4776 Thank You!