• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data with Not Only SQL
 

Big Data with Not Only SQL

on

  • 41,513 views

Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...

Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...

Statistics

Views

Total Views
41,513
Views on SlideShare
40,561
Embed Views
952

Actions

Likes
200
Downloads
855
Comments
17

28 Embeds 952

http://www.scoop.it 656
http://www.linkedin.com 80
https://tasks.crowdflower.com 40
http://thetechnicalweb.blogspot.com 33
http://paper.li 23
http://thetechnicalweb.blogspot.in 23
http://blogs.sun.com 21
http://www.blogger.com 14
http://abjaleelkk.posterous.com 12
http://localhost 11
http://snf-59420.vm.okeanos.grnet.gr 9
https://www.linkedin.com 9
http://www.lifeyun.com 4
http://mmilonakis.ced.tuc.gr 2
http://192.168.1.12 2
https://twitter.com 1
http://posterous.com 1
http://clickwatchlearn.blogspot.in 1
http://www.techgig.com 1
http://dev.techarda.com 1
http://learn.ced.tuc.gr 1
http://thetechnicalweb.blogspot.de 1
http://thetechnicalweb.blogspot.com.es 1
http://translate.googleusercontent.com 1
http://solutions-review.com 1
http://thetechnicalweb.blogspot.kr 1
http://thetechnicalweb.blogspot.fr 1
https://www.linkedin-ei.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

110 of 17 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…

110 of 17 previous next

Post Comment
Edit your comment

    Big Data with Not Only SQL Big Data with Not Only SQL Presentation Transcript

    • Open for Business…
    • WHO AM I • Big Data / Analytics / BI & Cloud Solutions Specialist • http://www.linkedin.com/in/JulioPhilippe • Skills Architecture Business Intelligence IT Transformation Cloud Computing IT Solutions Management Mentoring Big Data Analytics Business Development Hadoop Datacenter Optimization Data Warehousing 2 Big Data with Not Only SQL
    • BIG DATA MANAGEMENT INSIGHT « Data don’t spring relevant, they become though ! » 3 Big Data with Not Only SQL
    • DATA-DRIVEN ON-LINE WEBSITES • To run the apps : messages, posts, blog entries, video clips, maps, web graph... • To give the data context : friends networks, social networks, collaborative filtering... • To keep the applications running : web logs, system logs, system metrics, database query logs... 4 Big Data with Not Only SQL
    • BIG DATA – NOT ONLY DATA VOLUME • Improve analytics and statistics models • Extract business value by analyzing large volumes of multistructured data from various sources such as databases, websites, blogs, social media, smart sensors... • Have efficient architectures, massively parallel, highly scalable and available to handle very large data volumes up to several petabytes 5 Thematics • • • • • • Web Technologies Database Scale-out Relational Data Analytics Distributed Data Analytics Distributed File Systems Real Time Analytics Big Data with Not Only SQL
    • BIG DATA APPLICATIONS DOMAINS • Digital marketing optimization (e.g., web analytics, attribution, golden path analysis) • Data exploration and discovery (e.g., identifying new data-driven products, new markets) • Fraud detection and prevention (e.g., revenue protection, site integrity & uptime) • Social network and relationship analysis (e.g., influencer marketing, outsourcing, attrition prediction) • Machine-generated data analytics (e.g., remote device insight, remote sensing, location-based intelligence) • Data retention (e.g. long term conservation, data archiving 6 Big Data with Not Only SQL
    • SOME BIG DATA USE CASES BY INDUSTRY Energy Telecommunications Retail  Smart meter analytics  Network performance  Dynamic price optimization  Distribution load forecasting & scheduling  New products & services creation  Localized assortment  Condition-based maintenance  Call Detail Records (CDRs) analysis  Supply-chain management  Customer relationship  Customer relationship management management Manufacturing Banking Insurance  Supply chain management  Fraud detection  Catastrophe modeling  Customer Care Call Centers  Trade surveillance  Claims fraud  Preventive Maintenance and Repairs  Compliance and regulatory  Reputation management  Customer relationship management  Customer relationship management  Customer relationship management Public Media Healthcare  Fraud detection  Large-scale clickstream analytics  Clinical trials data analysis  Fighting criminality  Abuse and click-fraud prevention  Patient care quality and program analysis  Threats detection  Social graph analysis and profile segmentation  Supply chain management  Cyber security  Campaign management and loyalty programs  Drug discovery and development analysis 7 Big Data with Not Only SQL
    • TOP 10 BIG DATA SOURCES 1. Social network profiles 2. Social influencers 3. Activity-generated data 4. SaaS & Cloud Apps 5. Public web information 6. MapReduce results 7. Data warehouse appliances 8. Columnar/NoSQL databases 9. Network and in-stream monitoring technologies 10. Legacy documents 8 Big Data with Not Only SQL
    • NEW DATA AND MANAGEMENT ECONOMICS Compute Trends Storage Trends New Analytics New Data Structure (Massively Parallel Processing, Algorithms…) Distributed File Systems, NoSQL Database, NewSQL…) Logical Data Warehouse Master/Slave Enterprise data warehouse Objects storage Multi-Structured Data Master/Master General purpose data warehouse Proprietary and dedicated data warehouse Distributed File Systems OLTP is the data warehouse Master Data Management, Data Quality, Data Integration 9 Big Data with Not Only SQL Federated/ Sharded
    • MOVING COMPUTATION TO STORAGE General Purpose Storage Servers • Combine server with disks & networking for reducing latency • Specialized software enables general purpose systems designs to provide high performance data services Moving Data processing to Storage Legacy Emerging Next Gen. Application Application Application Data Processing Data Processing Metadata Mgmt Network Data Processing Metadata Mgmt Storage Metadata Mgmt Storage Storage Storage Array (SAN, NAS) 10 Big Data with Not Only SQL Servers
    • BIG DATA ARCHITECTURE BI & DWH Architecture - Conventional • SQL based • High availability • Enterprise database • Right design for structured data • Current storage hardware (SAN, NAS, DAS) Analytics Architecture – Next Generation • Not only SQL based • High scalability, availability and flexibility • Compute and storage in the same box for reducing the network latency • Right design for semi-structured and unstructured data App Servers Edge Nodes Network Switches Network Switches Database Servers Storage Array SAN Switch 11 Data Nodes Big Data with Not Only SQL
    • DATA WAREHOUSE • Data Warehouse appliances – EMC Greenplum – Microsoft Parallel Data Warehouse – IBM Netezza – Oracle Exadata – SAP HANA – ParAccel Analytic Database – Teradata – HP Vertica 12 • SQL Database • Massively Parallel Processing • Hadoop Connectivity • Column-Oriented database • In-Memory database Big Data with Not Only SQL
    • MAPREDUCE ALGORITHMS MapReduce • MapReduce is the programming paradigm popularized by Google researchers • Open-source Hadoop implementation of MapReduce by Yahoo • Open source software framework for distributed computation • Parallel computation (Map) on each block (Split) of data in an DFS file and output a stream of (Key, Value) pairs to the local file system • JobTracker schedules and manages jobs • TaskTracker executes individual map() and reduce() tasks on each cluster node 13 Algorithms • Association Rule Learning Algorithms • Genetic Algorithms • Neural Network Algorithms • Statistical Algorithms (Pandas) • Machine Learning Algorithms (Mahout, Weka, Scikit Learn) • Natural Language Processing Algorithms • Trading Algorithms • Clinical design Algorithms • Searching Algorithms (Lucene, Solr, Katta, ElasicSearch, OpenSearchServer…) Big Data with Not Only SQL Languages • PHP • Erlang • Python • Ruby • R • Java
    • DISTRIBUTED FILE SYSTEMS • System that permanently store data • Divided into logical units (files, shards, chunks, blocks…) • A file path joins file and directory names into a relative or absolute address to identify a file Master Slave Slave • Support access to file and remote servers • Support concurrency App • Support distribution • Support replication • NFS, GPFS, Hadoop HDFS, GlusterFS, MogileFS, MooseFS…. 14 Big Data with Not Only SQL Slave
    • NOSQL DATABASES CATEGORIES Column BigTable (Google), HBase, Cassandra (DataStax), Hypertable… NoSQL = Not only SQL • Key-Value Redis, Riak (Basho), CouchBase, Voldemort (LinkedIn) MemcacheDB… Popular name for a subset of structured storage software that is designed with the intention of delivering increased optimization for high-performance operations on large datasets • Basically, available, scalable, eventually consistent • Easy to use • Tolerant of scale by way of horizontal distribution Graph Neo4j (Neo Technology), Jena, InfiniteGraph (Objectivity), FlockDB (Twitter)… 15 Big Data with Not Only SQL Document MongoDB (10Gen), CouchDB, Terrastore, SimpleDB (AWS) …
    • NOSQL DATABASES CATEGORIES Key-Value Column Document Graph           Store items as alphanumeric identifier (Key) Associate values in a simple standalone tables Values must be (string, list, set) Data search base on key Fast and highly scalable to retrieve a value    BigTable-style database Column-oriented data structure that accommodates multiple attributes per key Petabyte scale Domains: Distributed data storage, Versioning with timestamp, Sorting, Parsing Data exploration     Domains: managing user profiles, retrieving product name… Documents (objects) map nicely to programming language data types Value = Collection>Document>Field Embedded documents and arrays reduce need for joins Dynamically-typed for easy schema evolution No joins and no multidocument transactions for high performance and easy scalability    Structured relational graphs of interconnected keyvalue pairings Object-oriented network of nodes (Node), Nodes Relationship (Edge), Properties (nodes attributes expressed as key-value pairs) Relation between data Domains: social networks, recommendations, investigations, relationships… Collection Key Value User001 Peter User002 Paul User003 Key Timestamp Type Size Document Name Age 12 Zebra Medium Doc001 Paul 30 11 Lion Big Doc002 Jacques 35 E2 13 Bird Small NoSQL Data Modeling Techniques Geo hashing, Index table, Composite keys aggregation, Materialized paths… http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ 16 Big Data with Not Only SQL Node Name Age X John 30 Y E1 Rick Node Bob 50 Edge a b X Y Y X
    • NEW SQL • Relational database with horizontal scalability • MySQL Ecosystem • Distributed database with MySQL compliance: Cubrid • Analytic database: InfiniDB • In-Memory database with MySQL compliance: VoltDB 17 Big Data with Not Only SQL
    • BIG DATA ARCHITETURE OVERVIEW ADMINISTRATOR ENGINEERS ANALYSTS BUSINESS USERS Development Data Management DATA SCIENTISTS Data Modeling BI / Analytics Activity Reporting Data Quality Master Data Management MOBILE CLIENTS Mobile Apps Data Analysis & Visualization NoSQL SQL Unstructured and structured Data Warehouse, MPP, No SQL Engine, Distributed File Systems Share-Nothing Architecture, Algorithms Structured Data Warehouse and OLAP Cubes, MPP, In-Memory, Columns Database, SQL Engine, Share-Nothing Architecture Data Transfer Data Integration Files 18 Web Data RDBMS Data sources Big Data with Not Only SQL
    • HDFS & MAPREDUCE • Clients Hadoop Distributed File System - Asynchronous replication - Write-once and read-many (WORM) - Hadoop cluster with 3 DataNodes minimum - Data divided into blocks, each block replicated 3 times (default) - No RAID required for DataNode - Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP - NameNode holds filesystem metadata - • A scalable, Fault tolerant, High performance distributed file system Files are broken up and spread over the DataNodes Hadoop MapReduce - Software framework for distributed computation - Input | Map() | Copy/Sort | Reduce() | Output - JobTracker schedules and manages jobs - 19 Master Node TaskTracker executes individual map() and reduce() tasks on each cluster node Big Data with Not Only SQL Worker Nodes
    • HBASE • • • • • • • • • • • • • Clone of Big Table (Google) Implemented in Java (Clients : Java, C++, Ruby...) Data is stored “Column‐oriented” Distributed over many servers Tolerant of machine failure Layered over HDFS Strong consistency It's not a relational database (No joins) Sparse data – nulls are stored for free Semi-structured or unstructured data Data changes through time Versioned data Scalable – Goal of billions of rows x millions of columns Table Row Timestamp Animal Repair Type Enclosure1 Enclosure2 Key Cost 12 Region Size Zebra Medium 1000€ 11 Lion Big 13 Monkey Small Family Column 1500€ Cell (Table, Row_Key, Family, Column, Timestamp) = Cell (Value) 20 Big Data with Not Only SQL
    • HBASE • Table - Regions for scalability, defined by row [start-key, end-key) Store for efficiency, 1 per Family - 1..n StoreFiles (HFile format on HDFS) • Everything is byte • Rows are ordered sequentially by key • Special tables -ROOT- , .META. - Tell clients where to find user data http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 21 Big Data with Not Only SQL
    • HADOOP INFRASTRUCTURE Network Switches 2 x Apps Server • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid1 22 2 x NameNode/BackupNode/Admin • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x DataNode • 2 CPU 6 core • 48 GB RAM • 12 x HDD
    • MOGILEFS OVERVIEW • • Asynchronous Replication • No Single Point of Failure • Automatic file replication (3 replications recommended) • Better than RAID • Flat NameSpace • Share-Nothing • No RAID required • Local filesystem agnostic • Tracker client transfer (mogilefsd) - Replication -- Deletion - Query - Reaper - Monitor Clients A scalable, Fault tolerant, High performance distributed file system Tracker Host1 Host4 Tracker • DBNode MySQL stores the MogileFS metadata (the namespace, and which files are where) • Host2 Storage Node Host5 Files are broken up and spread over the Storage Node (mogstored) HTTP and WebDAV server • Storage Node Client Library : Ruby, Perl, Java, Python, PHP… DBNode Host3 23 Big Data with Not Only SQL Storage Node Host6
    • MOGILEFS ARCHITECTURE Database Client Library Tracker Tracker Storage Node 24 Storage Node Big Data with Not Only SQL
    • MOGILEFS INFRASTRUCTURE Network Switches °°° 2 x Apps Server • 2 CPU 6 core • 48 GB RAM • 6 x HDD 600GB 15K Raid1 25 2 x DB Node + 2 to n x Tracker • 2 CPU 6 core • 32 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x Storage Node • 2 CPU 6 core • 32 GB RAM • 12 x HDD
    • GLUSTERFS OVERVIEW • A scalable, Fault tolerant, High performance distributed and replicated file system • No Single Point of Failure • Synchronous replication of volumes across storage servers • Asynchronous replication across geographically distributed clusters • Easily accessible usage quotas • No Meta-Data Server (fully distributed architecture - Elastic Hash) • Distributed / Distributed Replicated / Distributed Striped • POSIX compliant • FUSE (Standard) • GlusterFS native, NFS, CIFS, HTTP, FTP, WebDAV, ZFS, EXT4… • No proprietary format to store files on disk • NameSpace : The unified global namespace aggregates disk and memory resources into a single pool, virtualizing the underlying hardware GlusterFS Server Host1 GlusterFS Server • Data Store : Data is stored in logical volumes that are abstracted from the hardware and logically partitioned from each other • Development: API, Command Line Interface, Python, Ruby, PHP languages 26 Clients Big Data with Not Only SQL Host2 GlusterFS Server Host3 GlusterFS Server Host4 GlusterFS Server Host5 GlusterFS Server Host6
    • GLUSTERFS ARCHITECTURE 27 Big Data with Not Only SQL
    • GLUSTERFS INFRASTRUCTURE Network Switches 2 x Apps Server • 2 CPU 6 core • 48 GB RAM • 6 x HDD 600GB 15K Raid1 28 2 x Backup Node / Admin • 2 CPU 6 core • 32 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x GlusterFS Server • 2 CPU 6 core • 32 GB RAM • 12 x HDD
    • MOOSEFS OVERVIEW • • • • • • • • • • • • 29 A scalable, Fault tolerant, High performance distributed and replicated file system Spread data over several physical servers which are visible to the user as one resource No Single Point of Failure Distribution of data across data servers via chunks Maximum chunks size = 64MB File duplication (1 to 3 and more if necessary) POSIX compliant FUSE Interface No proprietary format to store files on disk Master Server: a single machine managing the whole filesystem, storing metadata for every file (information on size, attributes and file location(s), including all information about non-regular files, i.e. directories, sockets, pipes and devices. Metadata is stored in memory Metalogger Server: any number of servers, all of which store metadata changelogs and periodically downloading main metadata file; so as to promote these servers to the role of the Managing server when primary master stops working Data Server any number of commodity servers storing files data and synchronizing it among themselves Big Data with Not Only SQL Clients Master Server Host1 Data Server Host2 Data Server Host3 Metalogger Server Host4 Data Server Host5 Data Server Host6
    • MOOSEFS READ PROCESS Read Process 1. Where is the data 2. The data is on x chunks servers 3. Send me the data 4. The Data http://www.moosefs.org/ 30 Big Data with Not Only SQL
    • MOOSEFS WRITE PROCESS Write Process 1. Where to write the data 2. Create new chunk on x chunk server 3. Success 4. Write the data 5. Synchronize the data 6. Success 7. Success 8. Send write session end signal http://www.moosefs.org/ 31 Big Data with Not Only SQL
    • MOOSEFS INFRASTRUCTURE Network Switches 2 x Apps Server • 2 CPU 6 core • 48 GB RAM • 6 x HDD 600GB 15K Raid1 32 2 x Master/ Metalogger/ Admin Server • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid1 Big Data with Not Only SQL 3 to n x Data Server • 2 CPU 6 core • 32 GB RAM • 12 x HDD
    • CASSANDRA OVERVIEW • Every node play the same role Cassandra API • Highly Available Storage Layer • Really fast reads, really fast writes • Flexible schemas Partitioner Replicator Failure Detector Cluster Membership Messaging Layer • Distributed, Replicated • No Master, no Slaves • No Single Point of Failure • Client can talk to any node • Written in Java 33 Tools Big Data with Not Only SQL
    • CASSANDRA – COLUMN-ORIENTED Key SuperColumn Column Column • Column +Name +Value +Timestamp • • • • • 34 Column Column Family • Think of it as a DB table Column • Key-Value Pair (not just a value, like a DB column) • Timestamp SuperColumn • Columns inside a column • The value are columns • No timestamp Keyspace – like a namespace, generally 1 per app Indexes Queries Big Data with Not Only SQL
    • CASSANDRA INFRASTRUCTURE Network Switches Cassandra Nodes • • • 35 2 CPU 6 core 32 GB RAM 12 x HDD Raid0 Big Data with Not Only SQL
    • MONGODB OVERVIEW Clients • Documents database oriented, High performance, scalability and availability • Support MapReduce • Shard: hold a portion of the total data. Reads and writes are automatically routed to the appropriate shard(s). Each shard is backed by a replica set – which just holds the data for that shard • Replica: set is one or more servers, each holding copies of the same data. At any given time one is primary and the rest are secondaries. If the primary goes down one of the secondaries takes over automatically as primary. All writes and consistent reads go to the primary, and all eventually consistent reads are distributed amongst all the secondaries. Replica set is an asynchronous cluster replication technology • Config: multiple config servers, each one holds a copy of the meta data indicating which data lives on which shard • Router: one or more routers, each one acts as a server for one or more clients. Clients issue queries/updates to a router and the router routes them to the appropriate shard while consulting the config servers • Client: one or more clients, each one is (part of) the user's application and issues commands to a router via the mongo client library (driver) for its language 36 Big Data with Not Only SQL mongos Servers Router mongod Servers Config mongod Servers Shard mongos Servers Router mongod Servers Config mongod Servers Shard
    • MONGODB DEPLOYMENT Shard Secondary Shard Shard mongod mongod mongod mongod mongod mongod mongod mongod mongod Primary Shard mongod mongod mongod Replica set Config mongod Router mongos mongos mongod mongod App 37 …. Big Data with Not Only SQL ….
    • MONGODB INFRASTRUCTURE Network Switches 1 to n Router server 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 38 1 to n Config servers 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 Big Data with Not Only SQL 1 to n Shard servers 2 CPU 6 core 48 GB RAM 12 x HDD 1TB 7.2K
    • COUCHDB OVERVIEW Clients • • • • • • • • • • • • Open Source Distributed Database RESTful API Schema-less document store (document in JSON format) Multi-Version-Concurrency-Control model User-defined query structured as map/reduce Incremental Index Update mechanism Multi-Master Replication model Written in Erlang Support MapReduce Easy to use data storage Easy to integrate with web applications : JavaScript, JSON Scalability for large web applications : Incremental Replication, bi-directional conflict detection and management • Query-able and index-able • Offline by default 39 Big Data with Not Only SQL CouchDB Servers Master CouchDB Servers Slave CouchDB Servers Slave • • • • • CouchDB Servers Master CouchDB Servers Slave CouchDB Servers Slave Master → Slave replication Master ↔ Master replication Filtered Replication Incremental and bi-directional replication Conflict management
    • COUCHDB FUNCTIONALITIES • Document storage – CouchDB server hosts named databases, which store documents • ACID Properties – CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state • Compaction – On schedule, or when the database file exceeds a certain amount of wasted space, the compaction process clones all the active data to a new file and then discards the old file • Views (Model, Function, Index) – View model is the method of aggregating and reporting on the documents in a database, and are built on-demand to aggregate, join and report on database documents – View function takes a CouchDB document as an argument and then does whatever computation it needs to do to determine the data that is to be made available through the view, if any. It can add multiple rows to the view based on a single document, or it can add no rows at all – View index is a dynamic representation of the actual document contents of a database, and CouchDB makes it easy to create useful views of data. But generating a view of a database with hundreds of thousands or millions of documents is time and resource consuming, it's not something the system should do from scratch each time • Security – To protect who can read and update documents, CouchDB has a simple reader access and update validation model that can be extended to implement custom security models • Distributed update and replication – CouchDB is a peer-based distributed database system, it allows for users and servers to access and update the same shared data while disconnected and then bi-directionally replicate those changes later 40 Big Data with Not Only SQL
    • COUCHDB INFRASTRUCTURE Network Switches 1 to n Router server 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 41 1 to n Master servers 2 CPU 6 core 96 GB RAM 6 x HDD 600GB 15K Raid10 Big Data with Not Only SQL 1 to n Slaves servers 2 CPU 6 core 48 GB RAM 12 x HDD 1TB 7.2K
    • THANK YOU