Successfully reported this slideshow.
Your SlideShare is downloading. ×

Intro to Big Data

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 123 Ad

Intro to Big Data

Download to read offline

Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.

Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.

Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.

Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Intro to Big Data (20)

Advertisement

More from Zohar Elkayam (16)

Recently uploaded (20)

Advertisement

Intro to Big Data

  1. 1. Zohar Elkayam CTO, Brillix Zohar@Brillix.co.il Twitter: @realmgic Introduction to Big Data
  2. 2. Agenda • What is big Data and the 3-Vs • Introduction to Hadoop • Who Handles Big Data and Data Science • NoSQL http://brillix.co.il2
  3. 3. Who am I? • Zohar Elkayam, CTO at Brillix • Oracle ACE Associate • DBA, team leader, instructor and senior consultant for over 16 years • Editor (and manager) of ilDBA – Israel Database Community • Blogger – www.realdbamagic.com http://brillix.co.il3
  4. 4. What is Big Data? http://brillix.co.il4
  5. 5. http://brillix.co.il5
  6. 6. So, What is Big Data? • When the data is too big or moves too fast to handle in a sensible amount of time. • When the data doesn’t fit conventional database structure. • What the solution becomes part of the problem.
  7. 7. Big Problems with Big Data • Unstructured • Unprocessed • Un-aggregated • Un-filtered • Repetitive • Low quality • And generally messy • Oh, and there is a lot of it
  8. 8. http://brillix.co.il9
  9. 9. MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, problems HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products AUTOMOTIVE Auto sensors reporting location, problems RETAIL Consumer sentiment Optimized marketing LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis for network capacity, Sample of Big Data Use Cases Today ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization
  10. 10. Most Requested Uses of Big Data • Log Analytics & Storage • Smart Grid / Smarter Utilities • RFID Tracking & Analytics • Fraud / Risk Management & Modeling • 360° View of the Customer • Warehouse Extension • Email / Call Center Transcript Analysis • Call Detail Record Analysis
  11. 11. The Challenge http://brillix.co.il12
  12. 12. The Big Data Challenge (3V)
  13. 13. Big Data: Challenge to Value Business Value  High Variety  High Volume  High Velocity Today  Deep Analytics  High Agility  Massive Scalability  Real Time Tomorrow Challenges
  14. 14. Volume • Big data come in one size: Big. Size is measured in terabytes, petabytes and even exabytes and zeta bytes. • The storing and handling of the data becomes an issue. • Producing value out of the data in a reasonable time is also an issue.
  15. 15. Velocity • The speed in which the data is being generated and collected. • Streaming data and large volume data movement . • High velocity of data capture – requires rapid ingestion. • What happens on downtime (the backlog problem).
  16. 16. Variety • Big Data extends beyond structured data: including semi- structured and unstructured information: logs, text, audio and videos. • Wide variety of rapidly evolving data types requires highly flexible stores and handling.
  17. 17. Big Data is ANY data Unstructured, Semi-Structure and Structured • Some has fixed structure • Some is “bring own structure” • We want to find value in all of it
  18. 18. Structured & Un-Structured Un-Structured Structured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual
  19. 19. Handling Big Data http://brillix.co.il20
  20. 20. Big Data in Practice • Big data is big: technological infrastructure solutions needed. • Big data is messy: data sources must be cleaned before use. • Big data is complicated: need developers and system admins to manage intake of data.
  21. 21. Big Data in Practice (cont.) • Data must be broken out of silos in order to be mined, analyzed and transformed into value. • The organization must learn how to communicate and interpret the results of analysis.
  22. 22. Infrastructure Challenges • Infrastructure that is built for: • Large-scale • Distributed • Data-intensive jobs that spread the problem across clusters of server nodes
  23. 23. Infrastructure Challenges – Cont. • Storage: • Efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data • With intelligent capabilities to reduce your data footprint such as: • Data compression • Automatic data tiering • Data deduplication
  24. 24. Infrastructure Challenges – Cont. • Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • Security capabilities that protect highly-distributed infrastructure and data
  25. 25. Intro to Hadoop http://brillix.co.il27
  26. 26. Apache Hadoop • Open source project run by Apache (2006). • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. • Apache Hadoop has been the driving force behind the growth of the big data Industry.
  27. 27. Hadoop Creation History
  28. 28. Key points • An open-source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers. • The complete technology stack includes • common utilities • a distributed file system • analytics and data storage platforms • an application layer that manages distributed processing, parallel computation, workflow, and configuration management • Cost-effective for handling large unstructured data sets than conventional approaches, and it offers massive scalability and speed
  29. 29. Why use Hadoop? Cost Flexibility Near linear performance up to 1000s of nodes Leverages commodity HW & open source SW Versatility with data, analytics & operation Scalability
  30. 30. Really, Why use Hadoop? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day • Failure is expected, rather than exceptional. • The number of nodes in a cluster is not constant. • Need common infrastructure • Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  31. 31. Hadoop Benefits • Reliable solution based on unreliable hardware • Designed for large files • Load data first, structure later • Designed to maximize throughput of large scans • Designed to leverage parallelism • Designed to scale • Flexible development platform • Solution Ecosystem
  32. 32. Hadoop Limitations • Hadoop is scalable but not fast • Some assembly required • Batteries not included • Instrumentation not included either • DIY mindset (remember Linux/MySQL?) • On the larger scale – Hadoop is not cheap (but still cheaper than using old solutions)
  33. 33. Example Comparison: RDBMS vs. Hadoop Typical Traditional RDBMS Hadoop Data Size Gigabytes Petabytes Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  34. 34. Relational Database Best Used For:  Interactive OLAP Analytics (<1sec)  Multistep Transactions  100% SQL Compliance Best Used For:  Structured or Not (Flexibility)  Scalability of Storage/Compute  Complex Data Processing  Cheaper compared to RDBMS Best when used together Hadoop And Relational Database
  35. 35. Hadoop Components http://brillix.co.il37
  36. 36. Hadoop Main Components • HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment. • MapReduce – programming paradigm for running processes over a clustered environments.
  37. 37. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 39
  38. 38. HDFS Node Types HDFS has three types of Nodes • Namenode (MasterNode) • Distribute files in the cluster • Responsible for the replication between the datanodes and for file blocks location • Datanodes • Responsible for actual file store • Serving data from files(data) to client • BackupNode (version 0.23 and up) • It’s a backup of the NameNode
  39. 39. Typical implementation • Nodes are commodity PCs • 30-40 nodes per rack • Uplink from racks is 3-4 gigabit • Rack-internal is 1 gigabit
  40. 40. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 42
  41. 41. MapReduce Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar - input myInputDirs - output myOutputDir - mapper /bin/cat - reducer /bin/wc • Runs programs (jobs) across many computers • Protects against single server failure by re-run failed steps. • MR jobs can be written in Java, C, Phyton, Ruby and etc. • Users only write Map and Reduce functions • MAP - Takes a large problem and divides into sub problems. Performs the same function on all subsystems • REDUCE - Combine the output from all sub-problems
  42. 42. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output 44 Map Reduce (Dean and Ghemawat, OSDI 2004)
  43. 43. MapReduce paradigm • Implement two functions: • Map(k1, v1) -> list(k2, v2) • Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer 45
  44. 44. 46 Divide and Conquer
  45. 45. MapReduce - word count example function map(String name, String document): for each word w in document: emit(w, 1) function reduce(String word, Iterator partialCounts): totalCount = 0 for each count in partialCounts: totalCount += count emit(word, totalCount) 47
  46. 46. MapReduce Word Count Process http://brillix.co.il48
  47. 47. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 49
  48. 48. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 50
  49. 49. MapReduce is NOT good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 51
  50. 50. Improving Hadoop http://brillix.co.il52
  51. 51. Improving Hadoop Core Hadoop is complicated so some tools were added to make things easier so tools were created to make things easier. Improving programmability: • Pig: Programming language that simplifies Hadoop actions: loading, transforming and sorting data • Hive: enables Hadoop to operate as data warehouse using SQL-like syntax.
  52. 52. Pig • Data flow processing • Uses Pig Latin query language • Highly parallel in order to distribute data processing across many servers • Combining multiple data sources (Files, Hbase, Hive) • Example:
  53. 53. Hive • Built on the MapReduce framework so it generates MR jobs behind it • Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS/Hbase. • Have partitioning and partition swapping • Good for random sampling • Example: CREATE EXTERNAL TABLE vs_hdfs ( site_id string, session_id string, time_stamp bigint, visitor_id bigint, row_unit string, evts string, biz string, plne string, dims string) partitioned by (site string,day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' STORED AS SEQUENCEFILE LOCATION '/home/data/'; select session_id, get_json_object(concat(tttt, "}"), '$.BY'), get_json_object(concat(tttt, "}"), '$.TEXT') from ( select session_id,concat("{", regexp_replace(event, "[{|}]", ""), "}") tttt from ( select session_id,get_json_object(plne, '$.PLine.evts[*]') pln from vs_hdfs_v1 where site='6964264' and day='20120201' and plne!='{}' limit 10 ) t LATERAL VIEW explode(split(pln, "},{")) adTable AS event )t2
  54. 54. HDFS Map/Reduced Hive PIG Yahoo persistence Yahoo scripting Facebook SQL Query Google Parallel HADOOP Technology STACK
  55. 55. Improving Hadoop (cont.) For improving access: • HBase: column oriented database that runs on HDFS. • Sqoop: a tool designed to import data from relational databases (HDFS or Hive).
  56. 56. Hbase What is Hbase and why should you use HBase? • Huge volumes of randomly accessed data. • There is no restrictions on column numbers for rows it’s dynamic. • Consider HBase when you’re loading data by key, searching data by key (or range), serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema. Hbase dont’s? • It doesn’t talk SQL, have an optimizer, support in transactions or joins. If you don’t use any of these in your database application then HBase could very well be the perfect fit. Example: create ‘blogposts’, ‘post’, ‘image’ ---create table put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value get ‘blogposts’, ‘id1′ ---select records
  57. 57. Sqoop What is Sqoop? • It’s a command line tool for moving data between HDFS and relational database systems. • You can download drivers for Sqoop from Microsoft and • Import Data/Query results from SQL Server to Hadoop. • Export Data from Hadoop to SQL Server. • It’s like BCP • Example: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --hive-import $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemData
  58. 58. Improving Hadoop (cont.) • For improving coordination: Zookeeper • For improving scheduling/orchestration: Oozie • For Improving UI: Hue • Machine learning: Mahout
  59. 59. HADOOP Technology Eco System
  60. 60. Hadoop Tools
  61. 61. 63 Hadoop cluster Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
  62. 62. Hadoop In The Real World http://brillix.co.il64
  63. 63. Who uses Hadoop?
  64. 64. Big Data Market Survey • 3 major groups for rolling your own Big Data: • Integrated Hadoop providers. • Analytical database with Hadoop connectivity. • Hadoop-centered companies. • Big Data on the Cloud.
  65. 65. Integrated Hadoop Providers • IBM InfoSphere Database DB2 Deployment options Software (Enterprise Linux), Cloud Hadoop Bundled distribution (InfoSphere BigInsights); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene NoSQL HBase
  66. 66. Integrated Hadoop Providers • Microsoft Database SQL Server Deployment options Software (Windows Server), Cloud (Windows Azure Cloud) Hadoop Bundled distribution (Big Data Solution); Hive, Pig NoSQL None
  67. 67. Integrated Hadoop Providers • Oracle Database None Deployment options Appliance (Oracle Big Data Appliance) Hadoop Bundled distribution (Cloudera’s Distribution including Apache Hadoop); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr NoSQL Oracle NoSQL Database
  68. 68. Integrated Hadoop Providers • Pivotal Greenplum Database` GreenPlum Database Deployment options Appliance (Modular Data Computing appliance), Software (Enterprise Linux), Cloud (Cloud Foundry) Hadoop Bundled distribution (Pivotal HD); Hive, Pig, Zookeeper, HBase NoSQL HBase
  69. 69. Hadoop Centered Companies • Cloudera – longest-established of Hadoop distribution. • Hortonworks – major contributor to the Hadoop code and core components. • MapR.
  70. 70. Big Data and Cloud • Some Big Data solution can be provided using IaaS: Infrastructure as a service. • Private clouds can be constructed using Hadoop orchestration tools. • Public clouds provided by Rockspace or Amazon EC2 can be use to start an Hadoop cluster.
  71. 71. Big Data and Cloud (cont.) • PaaS: Platform as a Service can be used to remove the need to configure or scale things. • The major PaaS Providers are Amazon, Google and Microsoft.
  72. 72. PaaS Services: Amazon • Amazon: • Elastic Map Reduce (EMR): MapReduce programs submitted to a cluster managed by Amazon. Good for EC2/S3 combinations. • DynamoDB: NoSQL database provided by Amazon to replace HBase.
  73. 73. PaaS Services: Google • Google: • BigQuery: analytical database suitable for interactive analysis over datasets of the order of 1TB. • Prediction API: machine learning platform for classification and sentiment analysis be done with their tools on customers data.
  74. 74. PaaS Services: Microsoft • Microsoft: • Windows Azure: a cloud computing platform and infrastructure that can be used as PasS and as IaaS.
  75. 75. Who Handles Big Data … and how? http://brillix.co.il77
  76. 76. Big Data Readiness • The R&D Prototype Stage • Skills needed: • Distributed data deployment (e.g. Hadoop) • Python or Java programming with MapReduce • Statistical analysis (e.g. R) • Data integration • Ability to formulate business hypotheses • Ability to convey business value of Big Data
  77. 77. Data Science • A discipline that combines math, statistics, programming and scientific instinct with the goal of extracting meaning from data. • Data scientists combine technical expertise curiosity, storytelling and cleverness to find and deliver the signal in the noise.
  78. 78. The Rise of the Data Scientist • Data scientists are responsible for • modeling complex business problems • discovering business insights • identifying opportunities. • Demand is high for people who can help make sense of the massive streams of digital information pouring into organizations
  79. 79. Big Data Scientist • Industry Expertise • AnalyticsSkills Big Data Engineers • Hadoop/Java • Non-Relational DB Agility and Focus on Value New Roles and Skills
  80. 80. Predictive Analytics • Predictive analytics looks into the future to provide insight into what will happen and includes what-if scenarios and risk assessment. It can be used for • Forecasting • hypothesis testing • risk modeling • propensity modeling
  81. 81. Prescriptive analytics • Prescriptive analytics is focused on understanding what would happen based on different alternatives and scenarios, and then choosing best options, and optimizing what’s ahead. Use cases include • Customer cross-channel optimization • best-action-related offers • portfolio and business optimization • risk management
  82. 82. How Predictive Analytics Works • Traditional BI tools use a deductive approach to data, which assumes some understanding of existing patterns and relationships. • An analytics model approaches the data based on this knowledge. • For obvious reasons, deductive methods work well with structured data
  83. 83. Inductive approach • An inductive approach makes no presumptions of patterns or relationships and is more about data discovery. Predictive analytics applies inductive reasoning to big data using sophisticated quantitative methods such as • machine learning • neural networks • Robotics • computational mathematics • artificial intelligence • Explore all the data and to discover interrelationships and patterns
  84. 84. Inductive approach – Cont. • Inductive methods use algorithms to perform complex calculations specifically designed to run against highly varied or large volumes of data • The result of applying these techniques to a real-world business problem is a predictive model • The ability to know what algorithms and data to use to test and create the predictive model is part of the science and art of predictive analytics
  85. 85. Share nothing vs. Share everything Share nothing Share everything Many processing engines Many Servers Data is spread on many nodes Data is located on a single storage Joins are problematic Efficient Joins Very Scalable Limited Scalability
  86. 86. Big Data and NoSQL http://brillix.co.il89
  87. 87. The Challenge • We want scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need • RDBMS is too generic and doesn’t cut it any more – it can do the job but it is not cost effective to our usages 90
  88. 88. The Solution: NoSQL • Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses • NoSQL databases have been around for ages under different names/solutions 91
  89. 89. The NOSQL Movement • NOSQL is not a technology – it’s a concept. • We need high performance, scale out abilities or an agile structure. • We are now willing to sacrifice our sacred cows: consistency, transactions. • Over 150 different brands and solutions (http://nosql-database.org/).
  90. 90. NoSQL or NOSQL • NoSQL is not No to SQL • NoSQL is not Never SQL • NOSQL = Not Only SQL
  91. 91. Why NoSQL? • Some applications need very few database features, but need high scale. • Desire to avoid data/schema pre-design altogether for simple applications. • Need for a low-latency, low-overhead API to access data. • Simplicity -- do not need fancy indexing – just fast lookup by primary key.
  92. 92. Why NoSQL? (cont.) • Developer friendly, DBAs not needed (?). • Schema-less. • Agile: non-structured (or semi-structured). • In Memory. • No (or loose) Transactions. • No joins.
  93. 93. Is NoSQL a RDMS Replacement? NO 97 Well... Sometimes it does…
  94. 94. RDBMS vs. NoSQL Rationale for choosing a persistent store: 98 Relational Architecture NoSQL Architecture High value, high density, complex Data Low value, low density, simple data Complex data relationships Very simple relationships Schema-centric Schema-free, unstructured or semistructured Data Designed to scale up & out Distributed storage and processing Lots of general purpose features/functionality Stripped down, special purpose data store High overhead ($ per operation) Low overhead ($ per operation)
  95. 95. Scalability and Consistency http://brillix.co.il99
  96. 96. Scalability • NoSQL is sometimes very easy to scale out • Most have dynamic data partitioning and easy data distribution • But distributed system always come with a price: The CAP Theorem and impact on ACID transactions 100
  97. 97. ACID Transactions Most DBMS are built with ACID transactions in mind: • Atomicity: All or nothing, performs write operations as a single transaction • Consistency: Any transaction will take the DB from one consistent state to another with no broken constraints, ensures replicas are identical on different nodes • Isolation: Other operations cannot access data that has been modified during a transaction that has not been completed yet • Durability: Ability to recover the committed transaction updates against any kind of system failure (transaction log) 101
  98. 98. ACID Transactions (cont.) • ACID is usually implemented by a locking mechanism/manager • Distributed systems central locking can be a bottleneck in that system • Most NoSQL does not use/limit the ACID transactions and replaces it with something else… 102
  99. 99. CAP Theorem • The CAP theorem states that in a distributed/partitioned application, you can only pick two of the following three characteristics: • Consistency. • Availability. • Partition Tolerance.
  100. 100. CAP in Practice http://brillix.co.il104
  101. 101. NoSQL BASE • NoSQL usually provide BASE characteristics instead of ACID. BASE stands for: • Basically Available • Soft State • Eventual Consistency • It means that when an update is made in one place, the other partitions will see it over time - there might be an inconsistency window • read and write operations complete more quickly, lowering latency
  102. 102. Eventual Consistency
  103. 103. Types of NoSQL http://brillix.co.il107
  104. 104. NoSQL Taxonomy Type Examples Key-Value Store Document Store Column Store Graph Store
  105. 105. SQL comfort zone size Complex Typical RDBMS Key Value Column Store Graph DATABASE Document Database Performance NoSQL Map
  106. 106. Key Value Store • Distributed hash tables. • Very fast to get a single value. • Examples: • Amazon DynamoDB • Berkeley DB • Redis • Riak • Cassandra
  107. 107. Document Store • Similar to Key/Value, but value is a document. • JSON or something similar, flexible schema • Agile technology. • Examples: • MongoDB • CouchDB • CouchBase
  108. 108. Column Store • One key, multiple attributes. • Hybrid row/column. • Examples: • Google BigTable • Hbase • Amazon’s SimpleDB • Cassandra
  109. 109. How Records are Organized? • This is a logical table in RDBMS systems • Its physical organization is just like the logical one: column by column, row by row Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 http://brillix.co.il113
  110. 110. Query Data • When we query data, records are read at the order they are organized in the physical structure • Even when we query a single column, we still need to read the entire table and extract the column Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 Select Col2 From MyTable Select * From MyTable http://brillix.co.il114
  111. 111. How Does Column Store Save Data Organization in row store Organization in column store http://brillix.co.il116
  112. 112. Graph Store • Inspired by Graph Theory. • Data model: Nodes, relationships, properties on both. • Relational Database have very hard time to represent a graph in the Database. • Example: • Neo4j • InfiniteGraph • RDF
  113. 113. • An abstract representation of a set of objects where some pairs are connected by links. • Object (Vertex, Node) – can have attributes like name and value • Link (Edge, Arc, Relationship) – can have attributes like type and name or date What is Graph NODE Edge
  114. 114. Graph Types Undirected Graph Directed Graph Pseudo Graph Multi Graph NODE Edge NODE NODE Edge NODE NODE NODE NODE
  115. 115. More Graph Types Weighted Graph Labeled Graph Property Graph NODE 10 NODE NODE Like NODE NODE NODE friend, date 2013 Name:yosi, Age:40 Name:ami, Age:30
  116. 116. Relationships ID:1 TYPE:F NAME:alice ID:2 TYPE:M NAME:bob ID:1 TYPE:G NAME:NoS QL ID:1 TYPE:F NAME:dafn a TYPE: member Since:2012
  117. 117. Conclusion • Big Data is one of the hottest buzzwords in last few years – we should all know what it’s all about • DBAs are often called upon big data problems – today DBAs needs to know what to ask to provide good solutions even if it’s not a database related issue • NoSQL doesn’t have to be Big Data solutions but Big Data often use NoSQL solutions http://brillix.co.il123
  118. 118. Thank You Zohar Elkayam Brillix Zohar@Brillix.co.il www.realdbamagic.com http://brillix.co.il124

×