Big Data - SysFera presentation at the CSCI


Published on

Big Data is on everyone's lips, but what are the available technical solutions to deal with it? We give a brief overview of several solutions: distributed filesystems, NoSQL databases, and end-to-end solutions that take into account computations.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data - SysFera presentation at the CSCI

  1. 1. 29.03.12 SysFera Big Data Technologies SysFera Benjamin Depardon
  2. 2. 29.03.12 SysFeraSysFera• 2001: Research project from the Graal team (Inria/ENS) – DIET: grid middleware• 2007: SysFera-DS used within the Décrypthon project – Used in production 24/7/365 since then – Selected by IBM to replace Univa-UD• 2010: Creation of SysFera, INRIA spin-off• 2012: A team of 14 (R&D: 4 engineers and 5 PhD) – Supported by two experts from INRIA and ENS – SysFera-DS 2
  3. 3. 29.03.12 SysFeraWhat is Big Data?• All kinds of data• Valuable insight, but difficult to extract• Several dimensions – Variety • Structured/unstructured • Text, audio, video… – Velocity • Time sensitivity • Streaming – Volume • Large files • Small files in large quantities – Variability • Different meanings/format over different time period 3
  4. 4. 29.03.12 SysFera What can you do with Big Data? Analyze a Variety of Information Analyze Information in Motion  Social media/sentiment analysis  Smart Grid management  Geospatial analysis  Multimodal surveillance  Brand strategy  Real-time promotions  Scientific research  Cyber security  Epidemic early warning system  ICU monitoring  Market analysis  Options trading  Video analysis  Click-stream analysis  Audio analysis  CDR processing  IT log analysis  RFID tracking & analysis Discovery & ExperimentationAnalyze Extreme Volumes of  Sentiment analysisInformation  Brand strategy  Scientific research Transaction analysis to create insight-based  Ad-hoc analysis product/service offerings  Model development Fraud modeling & detection  Hypothesis testing Risk modeling & management  Transaction analysis to create insight- Social media/sentiment analysis based product/service offerings Environmental analysis Manage and Plan  Operational analytics – BI reporting  Planning and forecasting analysis  Predictive analysis  …
  5. 5. 29.03.12 SysFera What can you do with Big Data? Financial Services Utilities  Fraud detection  Weather impact analysis on  Risk management power generation  360° View of the Customer  Transmission monitoring  Smart grid managementTransportation IT Weather and traffic  Transition log analysis impact on logistics and for multiple fuel consumption transactional systems  CybersecurityHealth & Life Sciences Epidemic early warning Retail system  360° View of the Customer ICU monitoring  Click-stream analysis Remote healthcare monitoring  Real-time promotions Telecommunications  CDR processing Law Enforcement  Real-time multimodal surveillance  Churn prediction  Situational awareness  Geomapping / marketing  Cyber security detection  Network monitoring
  6. 6. 29.03.12 SysFeraWhat do you need?• Hardware – Storage capacity – Computing power• Software – Storage • Filesystems • Databases – Computation framework 6
  7. 7. 29.03.12 SysFera DISTRIBUTED FILESYSTEMS 7
  8. 8. 29.03.12 SysFeraHDFS• Hadoop Distributed File System• Open source (Apache)• Design – High throughput instead of low latency – Large data sets (large files), data locality – Fault tolerance (replication) – Write once and read-many (WORM) – Userspace• Limitations – Write-once model – Cannot be mounted by existing OS – No quotas/access permissions – Name node is a single point of failure• Used by Yahoo, Twitter, Rackspace, LinkedIn, Facebook… 8
  9. 9. 29.03.12 SysFeraGlusterFS• Open source (GPLV3) NAS file system• Runs in userspace• File-based distributed mirroring, replication, striping, load balancing• FUSE, POSIX compliant• Storage quotas• No meta-data server (fully distributed architecture, elastic hash)• Unified global namespace: aggregation of disk and memory in a single pool• Data is stored in logical volumes that are abstracted from the hardware and logically partitioned from each other• Multiprotocole client support: GlusterFS native, NFS, CIFS, HTTP, WebDAV, FTP• Real time Self-healing• VM live replication 9
  10. 10. 29.03.12 SysFeraLUSTRE• Open Source (GPL)• Object based: separate metadata and file data – Meta Data Servers (MDS) nodes – Object Storage Servers (OSS) nodes• Consistency: Lustre distributed lock manager (MSD and OSS)• Performance: – data can be striped – MDT is only involved in pathname and permission checks, and is not involved in any file IO operations• POSIX interface• Lustre Network (LNET): infinibands, TCP/IP, Myrinet…• Targeted to manage large files 10
  11. 11. 29.03.12 SysFera DATABASES 11
  12. 12. 29.03.12 SysFeraCAP theorem (Brewer’s theorem)It is impossible for a distributed computersystem to simultaneously provide all three ofthe following guarantees:• Consistency• Availability• Partition tolerance 12
  13. 13. 29.03.12 SysFeraNoSQL• Release ACID conditions• 4 types of NoSQL bases – Key-value (Memcached, Voldemort): data agnostic – Document oriented (CouchDB, MongoDB) : data conscious – Column oriented (Big Table, Hbase, Cassandra) – Graph (Neo4j)• Requires more work on the client side 13
  14. 14. 29.03.12 SysFeraMemCached• Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.• Simple Key/Value Store• Smarts Half in Client, Half in Server• Servers are Disconnected From Each Other• O(1) Everything• Forgetting Data is a Feature• Used by LiveJournal, flickr,, Wikipedia, YouTube … 14
  15. 15. 29.03.12 SysFeraMongoDB• Document oriented• Transport and storage: BSON format (derived from JSON, but binary)• Queries – no join – Map/reduce• Database contains collections• Collections contain documents• Master-slave replication 15
  16. 16. 29.03.12 SysFeraCassandra• Column oriented (inspired from Big Table & Dynamo)• Notion of super-columns – (sorted) associative array of columns• Range queries on keys• Low latency: sequential access to disk• O(1) DHT• Eventual Consistency• Values limited to 2GB• RPC with Thrift 16
  17. 17. 29.03.12 SysFeraNeo4J• Graph oriented• Fully ACID transactions• Data is stored as a graph/network – Nodes and relationships with properties – "Property graph" or "edge-labeled multidigraph"• Queries – Indexing of nodes and properties – Graph traversal• Disk-based, native storage• Java, REST API• Master-slave load balancing• Use case: social network 17
  18. 18. 29.03.12 SysFeraPaaS Databases• Different providers – Amazon: RDS, SimpleDB – Google: AppEngine (GQL) – Microsoft: SQL Azure• Different cost models – CPU hour – CPU hour + traffic – Monthly fee + CPU hour + traffic All depend on the load (number of users) 18
  19. 19. 29.03.12 SysFera SOLUTIONS 19
  20. 20. 29.03.12 SysFera GO-Transfer: Data transfer as SaaSReliable file transfer. Easy “fire-and-forget” transfers Automatic fault recovery High performance Across multiple security domainsNo IT required. Software as a Service (SaaS) No client software installation New features automatically available Consolidated support & troubleshooting Works with existing GridFTP servers Globus Connect solves “last mile problem”GO-Transfer is the initial offering of the US NationalScience Foundation’s XSEDE User Access Services (XUAS) © Ian Foster 20
  21. 21. 29.03.12 SysFeraHadoop environment PIG (Data Flow) HIVE (Batch SQL) SQOOP (Data Import) ZOOKEEPER (Coordination) AVRO (Serialization) CHUKWA (Displaying, Monitoring, Analysing Logs) MAP REDUCE (Job scheduling – Raw processing) HBASE (Real Time Query) HDFS (Hadoop Distributed File System – Unstructured Storage) 21
  22. 22. 29.03.12 SysFera IBM Big Data Platform InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume Hadoop Information Stream ComputingInfoSphere Information Server Integration InfoSphere StreamsHigh volume data integration and Low Latency Analytics for streaming transformation data MPP Data Warehouse IBM InfoSphere IBM Netezza High Capacity IBM Netezza 1000 IBM Smart Analytics System IBM Informix Timeseries Warehouse Appliance BI+Ad Hoc Analytics Structured Data Operational Analytics on Time-structured analytics Large volume structured Queryable Archive Structured Structured Data data analytics Data 22
  23. 23. 29.03.12 SysFeraSysFera-DS 23
  24. 24. 29.03.12 SysFera Dataflows • Iteration strategies • Automatic parallelism • Control structure (if/then/else, do/while) • Fault tolerant • Multi-workflow scheduling HALOMAKER GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF ... ... ... ... ... ... ... ... ... RAMSES RAMSESGRAFIC2 RAMSES HALOMAKER TREEMAKER GALAXYMAKER MOMAF RAMSES GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF MPI Mock catalogues ... ... ... ... ... ... ... ... ... n snapshots GALAXYMAKER MOMAF HALOMAKER GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF x tree files Parameter 24 sweep
  25. 25. 29.03.12 SysFeraDAGDA• Meta data-manager• Data management from end to end• Data replication – Explicit – Implicit• Data persistency• Memory and disk quotas• Replacement algorithms (LRU, LFU, FIFO)• Best source selection• Strong link with task manager• Pluggable policies, local data managers 25
  26. 26. 29.03.12 SysFera Thank you! Questions?Benjamin.Depardon@SysFera.com 26
  27. 27. 29.03.12 SysFeraBibliography• « Big Data & Open Source: Une convergence inévitable ? », Stefane Fermigier, whitepaper-big-data-open-source/• « Visual Guide to NoSQL Systems »,• The Cassandra Distributed Database », Eric Evans,• « Big Data Architecture », Julio Philippe, architecture• « Big Data in Real-Time analysis at Twitter », Nick Allen,• … 27