Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data and You
2015
May
Edition
Objectives
This document is designed to introduce big Data
and Analytics . Instead of be...
# 1
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Introduction Definition
What is Data Analysis ? Why Ana...
# 2
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Definition
What is Business Intelligence ?
Business ana...
OLTP versus OLAP
# 3
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
BI reference Architecture
Reporting so...
First steps - early1950
IBM newspaper : Article " A Business Intelligence System" (Hans Peter Luhn)
Birth of the wording “...
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Why Hadoop ?
1
2
Performance issue : Consider that over the...
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
How does it work ?
Apache Hadoop is a set of algorithms (an...
YARN, “the hadoop 2 “ decouples MapReduce's resource management and
scheduling capabilities, enabling Hadoop to support mo...
# 8
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Different Approaches
Don’t take us wrong : there is no ...
# 9
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Technical Hadoop Patterns
Big Data
Exploration
Find, vi...
# 10
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Hadoop has been most rapidly adopted by the government...
# 11
IBM Montpellier Client Center
The market for Big Data &
Analytics solutions has
exploded
The race is hot and complex:...
# 12
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
4 major distributions
of Hadoop have
spawned ecosystem...
# 13
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
In-Memory - good timing for an old idea
Largely driven...
# 14
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
 Deal with Terabytes of data
each second
 Work with ...
# 15
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Hadoop is an Open Source implementation and although v...
# 16
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
How are leading companies transforming their data and ...
# 3
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message :
Systems of Record
Structured data from
opera...
# 18
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Important to keep in mind
Big Data (BigInsights, Cogno...
# 20
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
IBM BigData RessourcesWw Competency Centers Big Data A...
> Strong history of leadership in open source & standards : IBM has always been a believer in
standardization of interface...
# 22
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : IBM fundamental cloud strategy : Complete c...
# 23
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : Spark is positioned as a fast and general e...
# 24
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : From application point of view, data lake c...
Upcoming SlideShare
Loading in …5
×

Big data and you

893 views

Published on

Big Data for dummies : 20 slides to grab the most essential messages to understand big data and become eager to learn more

Published in: Data & Analytics
  • Be the first to comment

Big data and you

  1. 1. Big Data and You 2015 May Edition Objectives This document is designed to introduce big Data and Analytics . Instead of being deep dive technical paper or product portfolio details, friendly educational presentation (easily and quickly read) for specialists, architects, PMs and managers (*). One simple goal (but complex and time consuming exercise): is you read this paper, you learn something and then you would like to get more details to become an expert. Yes, You can Big Data Table of Contents 1. Introduction 2. Definition 3. BI principles 4. Chronology 5. Hadoop I 6. Hadoop II 7. Hadoop Ecosystem 8. BI vs Big Data 9. Hadoop patterns 10. Hadoop Market Introduction 2012 was the big data marketing buzz, 2013 was the big data technical enablement, 2014 was the big data projects. Now European customers are massively deploying big data (and still analytics) projects. It is time to become an expert to guide our customers and talk with Big Data ecosystem to fill the Big Data skills gap (*) This paper doesn’t pretend to be exhaustive on the Big Data subject, nor it is intended to recommend precise and specific architecture for architects, recommend performance and technical details for specialists or marketing campaign. It doesn’t assume, or require any (or few) knowledge of Big Data 11. BD&A vendors 12. Competition 13. In Memory 14. Streams 15. BigInsights 16. Architecture 17. Positioning 18. Why Power ? 19. Contacts 20. New ! Author # Christophe.menichetti@fr.ibm.com
  2. 2. # 1 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Introduction Definition What is Data Analysis ? Why Analysing Data ? Analysis of data is a process of inspecting, cleaning, transforming, and modelling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains, such as : Business Intelligence/Analytics Data Mining / predictive Tools Big Data Data integration/ Data visualisation And so on … IT technologies and computer sciences are evolving. Yesterday, when IBM, Honeywell, Sperry, ICL, Xerox,Digital or Olivetti were the IT leaders, CPU and Memory were the key differentiators. Today, when IBM, Google,SAP, Oracle are the IT leaders, the ultimate differentiator is being able to make more informed choices with confidence, to anticipate and shape business outcomes. As company and industry leaders, you absolutely need deeper insight from their information, to beat your competitors : • Which customers are thinking of leaving? • Which transactions are fraudulent? • Detect life-threatening conditions in time to intervene Let’s make it simpler – An example Analytics = transforming data into (sexy) information to make (intelligent) decision Weather Forecast : You should decide which boot you’ll take to go to Paris. You are not expert at all (temperature, pressure, cyclone = RAW data) but you can decide based on weather map (report/analysis) !message : Data is the new oil requiring Mining, Refining and Delivering BI Principles Chronology Hadoop I Hadoop II Big data and You
  3. 3. # 2 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Definition What is Business Intelligence ? Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence (BI) traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods Big Data is a broad term for data sets so large or complex that they are difficult (or too expensive) to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. What is Big Data ? !message : Big Data creates new opportunities to extend Analytics for higher value BI Principles Hadoop I Hadoop IIIntroduction Hadoop Ecosystem Big data and You 4th V: Value 5th V: Veracity For more information/technical details, feel free to contact us
  4. 4. OLTP versus OLAP # 3 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com BI reference Architecture Reporting solutions display data in a either synthesized or detailed view, easy to understand for the end user (data mining: discovering Interesting/useful patterns /relationships in large volumes of data – analyzing the past to predict the future) Data warehouse central database in which data are stored and can be restructured to answer Business needs. ETL Unifies data from heterogeneous data sources (extracting the useful data) Consolidates them into a unique destination database (cleansing, modifying the data according to the desired output) Good to know ! People, very often, associate BI with reporting/data mining tool, because this is the “visible” part of the iceberg. But This is an misnomer, BI refers to the full set of tools, such as Reporting, Data warehouse and ETL. For your information, ~70% of the costs and efforts in BI projects is about the data warehouse, the most important (but hidden) part of the “iceberg”. Star Schema Optimized for SQL read requests. Fact table (metrics of the reports) in the middle, surrounded by dimension tables (Y axis) = On Line Analytical Processing (OLAP) 3NF Schema Optimized for flexibility and storage space savings = On Line Transactional Processing (OLTP) How does Analytics work ? What does OLAP mean ? !message : BI/Analytics is the way to transform raw data into decision/information Definition BI Principles Hadoop IHadoop IIuction Hadoop EcosystemChronology BI vs B Big data and YouAny Analytics Projects/ questions ? Do not hesitate to contact us
  5. 5. First steps - early1950 IBM newspaper : Article " A Business Intelligence System" (Hans Peter Luhn) Birth of the wording “Business intelligence” First tools for automatic methods, providing alert services (for scientists) 1970 First MIS solutions – Management Information System Static, non flexible No analysis features 1980 First EIS software – Executive Information System More sophisticated MIS: simulations, report, forecast, 1990 BI concepts, is officially formalized by Howard Dresner, Gartner Group analyst Birth of Business Performance Management (BPM / EPM) 2005 – 2010 BI market strong consolidation – big major IT acquisitions Oracle acquired Siebel (Report - 6B$), Hyperion (EPM- 4B$), Sunopsis (ETL- 1 B$) SAP acquired Business Objects (Report – 7B$), Sysbase (DW – 6B$), Fuzi (ETL), IBM bought Cognos (Report – 5B$), Netezza (DW – 2B$), Ascential (ETL – 1B$) - Yahoo and Google faced terrible performance issues with DW architecture – Need of rethinking data analysis approach – birth of Hadoop 2012 and + Birth of Big data # 4 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com A little bit of history ? !message : Analytics has evolved from business initiative to business imperative Definition BI Principles Hadoop I Hadoop IIHadoop EcosystemChronology BI vs BigData Hadoop Big data and You
  6. 6. IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Why Hadoop ? 1 2 Performance issue : Consider that over the past decade : - CPU speed performance has increased 8 to 10 times - DRAM speed performance has increased 7 to 9 times - Network speed performance has increased 100 times - Bus speed performance has increased 8 to 10 times - Hard disk drive speed performance has increased ONLY 1.2 times NoSQL: Not Only SQL Mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  Motivations for this approach include simplicity of design, horizontal scaling, finer control over availability and most importantly COST !message : Hadoop meets the need of new scalable architectures providing a business Efficiency and flexibility over the existing relational data model ciples Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market # 5 Big data and YouWould like to bench/test ? Go to MOP Client Center
  7. 7. IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com How does it work ? Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking Would like to appear like an expert ? HDFS default replication : 3 x, HDFS default blocks size = 128 MB, HDFS sits on top of a native Linux filesytem (ext4, ext3), Slave nodes : HDFS (= data node), MapReduce (= task tracker) , Master nodes : HDFS (= name node), MR (= job tracker), secondary name node is for High Availability !message : Volume and Variety challenges have led to the creation of new data processing : Map Reduce and HDFS Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market BD&A # 6 Big data and YouWould like briefing ? Go to MOP Client Center
  8. 8. YARN, “the hadoop 2 “ decouples MapReduce's resource management and scheduling capabilities, enabling Hadoop to support more varied processing approaches/applications (interactive SQL, real-time streaming, batch processing) # 7 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Flume was created to allow you to flow data from a source into your Hadoop® environment. ZooKeeper provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments like configuration information, hierarchical naming space … HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases Some folks at Facebook developed Hive™, allowing SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements Oozie simplifies workflow and coordina¬tion between jobs. It provides users with the ability to define actions and dependencies between actions. Pig initially developed at Yahoo! allows people to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop Mahout takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters !message : The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache Hadoop II Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition Big data and You
  9. 9. # 8 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Different Approaches Don’t take us wrong : there is no bad approach or good approach, there is no magical approach. There are different approaches, for different needs and results. With BI approach, Business Users determine what question to ask (business hypothesis) and IT team structures the data (specific selected data into data warehouse) to answer to the question. With Big Data approach, IT delivers (all data) a platform to enable creative discovery and Business Users Explores what questions could be asked Different Architectures BI architecture: Application server and Database server are separated, Network is still in the middle, Data have to go through the network. Big Data architecture: Analysis Program runs where are the data : Functions have to go through the network. This is highly scalable and flexible by design Different Objectives Hadoop is one of the multiple facets of Big Data. This facet (Hadoop) is designed to run huge (Volume) “read” batch, in extreme costs savings way for unstructured data (Variety) !message : Do not compare apples and oranges : you should (still) need both Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Big data and YouFor more information/technical details, feel free to contact us
  10. 10. # 9 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Technical Hadoop Patterns Big Data Exploration Find, visualize, understand all big data to improve decision making Enhanced 360o View of the Customer Extend existing customer views (MDM, CRM, etc) by incorporating additional internal and external information sources Operations Analysis Analyze a variety of machine data for improved business results Data Warehouse Augmentation Integrate big data and data warehouse capabilities to increase operational efficiency Security/Intelligence Extension Lower risk, detect fraud and monitor cyber security in real- time Big Data Business Use Cases Keep in Mind The term Big Data is a bit of a misnomer. Big data is not only referring to huge volume of data or Hadoop, there are many others patterns using streams or in memory solutions !message : Big Data Analytics are applied Across all Industries, different use cases BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Big data and You
  11. 11. # 10 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Hadoop has been most rapidly adopted by the government,banking,finance,IT and ITES, and insurance sectors Geographical analysis of the market seems to suggest that North Americais the leadingrevenuegenerating market and will continue to remain so till 2020. Hadoop hardware-based,solution providershave been the highest receivers of venture capital funding.The recent times have witnessed a steep demandfor real-time,operationalanalytics !message : In 1990’s new performing hardware was the differentiator for companies to compete. Nowadays big data is the key competitive differentiator Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Big data and You Hortonworks study – 2014 wikibon figures - 2013
  12. 12. # 11 IBM Montpellier Client Center The market for Big Data & Analytics solutions has exploded The race is hot and complex:  Every vendor is jumping in  Alternatives from everywhere  Startups proliferate  Partnerships No other vendor has what IBM have – Software/ Hardware – Services / Research – Cloud, Mobile, Social Yet just having ‘everything’ does not make for a market leader Based primarily on 2012 Wikibon report/forcast http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017 !message : The race is hot, Every vendor is jumping in, Alternatives from everywhere, Startups proliferate, how do we differentiate in such a crowded market? Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Big data and YouAny competitive big data questions ? feel free to contact us
  13. 13. # 12 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com 4 major distributions of Hadoop have spawned ecosystems of partners developing data management and analytic solutions for Big Data !message : IBM is a global Big data and Analytics leaders, industry’s most comprehensive and enterprise class solutions, broadest portfolio BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Big data and YouAny competitive big data questions ? feel free to contact us
  14. 14. # 13 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com In-Memory - good timing for an old idea Largely driven by the big data phenomenon, In-memory computing is a powerful, transformative IT trend to meet high-performance analytics expectations and data visualization needs. In memory solution should not be confused with conventional DBMS storing data in disk blocks cached in memory. In-Memory” Database technology has been around for over a decade. Traditionally in-memory technology was used in a limited number of operational applications workloads (FSS trading, Telco Billing, HPC, embedded devices) but in 2011 we saw Inflection Point : Increased focus and ‘push’ by SAP With in-memory database, all information is initially loaded into memory. This eliminates the need for optimized databases, indexes, aggregates and designing of cubes and star schemas. The arrival of column centric databases which stored similar information together allowed storing data more efficiently with greater compression and faster read access , reducing the amount of memory needed to perform a query and increasing processing speed. That’s why column- based technology is very often associated to in memory technology Column Based Technology Volume: users /data increase, RAM needed also increases = hardware costs Velocity : real time analytics, operational analytics !message : Big Data analytics can benefit from these very large in memory Systems for velocity (since Memory has become cheaper) dors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP
  15. 15. # 14 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com  Deal with Terabytes of data each second  Work with application, sensor and internet data, video/audio  Deliver insight in microseconds to analytical applications  Support complex scenarios using C++ or Java code Streams is tailor made for companies who need to process data from non-traditional sources, with huge volumes of data, and need results very, very quickly, integrated with existing analytics investments  Stream computing is a different paradigm – the left shows the traditional way data is accessed using queries to pull the data from a data storage device such as a data warehouse or database – which is still valid for many requirements  The new stream computing paradigm brings data to the query – data is pushed or flows through the analytics. This is required for many new use cases in big data  Here’s a little more on how streams works and what you can do with it.  Each of these square represents an operator. The data passes (input stream) through each operator where some action is being performed on the data (output stream)  You can fuse data form multiple streams, you can modify it, annotate it, perform an analytics operation on it, fuse multiple streams or classify it. !message : Velocity challenges have led to the creation of new data computing paradigm and solution: streaming to bring microseconds effective real time In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP
  16. 16. # 15 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Hadoop is an Open Source implementation and although very well maintained, doing the “job” for companies it implies a risk. Like Linux, major IT companies provide Hadoop distributions. IBM took this Hadoop and ruggedized it for enterprises, adding enterprises features such as performance, resilience and IBM experiences, (bigsheets, bigsql,gpfs…) while maintaining the open standards 100%. We call it Biginisghts, running on x86, Power Systems and Mainframe (linux) 2 editions : basic edition (100% open source – free) and Enterprise Edition BigSheets - a big data visualization capability that enables end users to collect, explore and uncover actionable insights through a commonly understood spreadsheet experience (drag and drop, clicks without any Java or Hadoop skills) Adaptive Map Reduce – Already proven product from Platform Computing (HPC acquisition) , rewriting Map Reduce paradigm in C++ (No garbage collection, faster memory management), allowing : • Optimized Shuffle, map sort • Resource management and scheduling of jobs is separated • leverage shared memory across JVMs, eliminating data movement BigSQL – SQL on Hadoop is challenging (wide variety of data, MR is batch oriented), BigSQL provides Native full compliant SQL access to data stored in BigInsights, Real JDBC/ODBC drivers, and optimization based on Massively Parallel processing (MPP) architecture, from DB2 experience Spectrum Scale – GPFS FPO (file placement optimizer) scalable, high performance, and highly reliable, 20+ years experienced product, has many advantages over HDFS: • POSIX compliant • No single point of failure • Multi tenant • HA/DR solutions IBM BigInsights for Apache Hadoop v4 has been just released based on ODP initative Version 3.0 – Enterprise Edition !message : IBM Hadoop strategy : better analytics tooling that is easier to use + commitment to Hadoop open source (ODP initiative) In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and You
  17. 17. # 16 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com How are leading companies transforming their data and analytics environment to take advantage of Big Data and provide faster, better insights at reduced costs within their existing Enterprise Data Warehouses ? 100010010101010 100010010101010 100010010101010 100001101010101100 100001101010101100 000111000010011 000111000010011 !message : The foundational schematic to bring analytics to all stages in the data lifecycle can be overlaid with specific products that provide the functions Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouNeed Customer Enablement ? Education ? Send us an email
  18. 18. # 3 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : Systems of Record Structured data from operational systems Transformational benefit / business outcomes come from integration of new data sources with traditional corporate data to find new insights Systems of Engagement Data that “connects” companies with their customers, partners and employees Systems of Insight Diverse data types that combine structured and unstructured data for business insight Streams BigInsights Architecture Positioning Why Power? Contacts/info In Memory Hadoop EDW Appliance # 17 Big data and YouNeed Architecture Workshop ? Sizing ? Send us an email
  19. 19. # 18 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Important to keep in mind Big Data (BigInsights, Cognos, SPSS, …) can run on IBM System z. Customers could take advantages of co-locating business data and OLAP data, managing high speed transactions and complex queries for real time operational analyticson a single integrated platform and take benefits of the performance, resiliency and quality of service of IBM Mainframe for critical businesses., as many banks/insurance customers !message : The infrastructure is a foundational piece to IBM’s perspective of delivering capabilities and offerings for BD&A Hadoop is Linux – Linux is Power Hadoop is cheap - Power is cheap Hadoop ecosystem – PowerLinux market acceptance Power advantages for Big Data Linux on Power – run the same commands as linux on x86 – versions release as the same date Linux on Power makes 17,6% of top 500 most linux powerful systems (with 5 in top 10) POWER8 increases performance, reliability and availability lead over Intel, alternative to intel OpenPower foundation brings Rapid innovation to Power Platform for open linux Little Endian support makes porting Linux on x86 applications even easier Power8 design point is for big data (more threads, more cache , more bandwidth, CAPI …) Intel design point is for multiple market (smart phone, tablet desktop PC, servers …) Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouFeel free to contact MOP PowerLinux center for more details
  20. 20. # 20 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com IBM BigData RessourcesWw Competency Centers Big Data Analytics Links Web sites ibm.com/Hadoop Information Management Acceleration Zone PowerLinux Big Data IBM communities IBM Systems Big Data and Analytics BDSC practitioner wiki IBM Analytics Global Big Data& Analytics Clients References IBM Developper Works https://www.ibm.com/developerworks/analytics/ Please, Please Help us in improving this document – if any comments / ideas please feel free to send an email http://bigdatauniversity.com/ http://wikibon.org/wiki/v/Category:Big_Data http://en.wikipedia.org/wiki/Apache_Hadoop http://www.slideshare.net/search/slideshow? searchfrom=header&q=big+data [INFO] Based on 3 experienced years of big data projects , after many weeks of intensive work for compiling several presentations done to customers or conferences, synthetizing concepts, the objective of this educational paper is to clarify some of the concepts and solutions around Big Data in order to better understand the related challenges and opportunities. But There may be (so many) typing errors, mistakes, misleading words, missing concepts, so Please be kind  Streams Biginsights Architecture Positioning Why Power? Contacts/info Big data and YouIf we can not help you directly, we’ill point you to the right person
  21. 21. > Strong history of leadership in open source & standards : IBM has always been a believer in standardization of interfaces to components of IT and application infrastructure (SQL, Eclipse, OpenPower …) > Supports our commitment to open source currency in all future releases > Accelerates IBM innovation within Hadoop & surrounding applications > Expecting Hortonworks, Pivotal distribution adoption on PowerLinux > The current ecosystem is challenged and slowed by fragmented and duplicated efforts. The ODP Core will take the guesswork out of the process and accelerate many use cases by running on a common platform. Freeing up enterprises and ecosystem vendors to focus on building business driven applications. # 21 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : ODP is clearly a major and strategic choice in Open community to accelerate Hadoop adoption and grow BigInsights and PowerLinux ecosystem / ISV NEW AND/OR HOT !!! OPEN DATA PLATFORM Big data and You What is Open Data Platform (ODP) ? > It is an Open-source, non-profit entity, focused, committed in evolving the current state of the platform, and delivering a Foundation certified, packaged, and tested Reference Distribution Why Open Data Platform (ODP) ? Where to position ODP vs Apache ? > ODP supports the Apache (ASF) mission > ASF provides a governance model around individual projects without looking at ecosystem > ODP aims to provide a vendor-led consistent packaging model for core Apache components as an ecosystem Why IBM is involved in ODP ?
  22. 22. # 22 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : IBM fundamental cloud strategy : Complete cloud offering, mixed between control and simplicity. Big data and You NEW AND /OR HOT !!! Big Data/Analytics and Cloud Customer Data Center (On-Premises) Cloud Data Center (Off Premises) SIMPLICITY CONTROL PureData for analytics DB2 BLU Infosphere Biginsights Cloudant DashDB Softlayer Cloudant DashDB Distributed NoSQL “Data Layer”, Powering Web, mobile, & IoT since 2009 Available as a fully-managed DBaaS, managed by you on-premises or hybrid Transactional JSON “document” database Spreads data across data centers & devices Ideal for apps that require: > Massive, elastic scalability > High availability > Geo-location services > Full-text search > Occasionally connected users Data warehouse and analytics as a service on the cloud • Next Generation In-Memory • Columnar • SIMD Hardware Acceleration • Actionable Compression • Support for OLAP SQL extensions • Connect common 3rd party BI tools dashDB keeps data warehouse infrastructure out of your way, allowing you to take benefits of :
  23. 23. # 23 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : Spark is positioned as a fast and general engine for Big Data. It generalizes the MapReduce model and (could?)is poised to replace MapReduce Big data and You NEW AND/OR HOT !!! SPARK Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms. Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos.For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in this scenario, Spark is running on a single machine with one executor per CPU core. Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects
  24. 24. # 24 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : From application point of view, data lake challenge is to be an unique and unified data repositories, queryable like a black box Big data and You NEW AND /OR HOT !!! DATA LAKE ARCHITECTURE IDC in late 2014 stated “By 2017 unified data platform architecture will become the foundation of BDA strategy. The unification will occur across information management, analysis, and search technology.”  A Data reservoir is a data lake that provides data to an organization for a variety of analytics processing including: • Discovery and exploration of data • Simple ad hoc analytics • Complex analysis for business decisions • Reporting • Real-time analytics  It is possible to deploy analytics into the data reservoir to generate additional insight from the data loaded into the data reservoir.  A data reservoir manages shared repositories of information for analytical purposes.  Each Data Reservoir Repository is optimized for a particular type of processing. • Real-time analytics, deep analytics (such as data mining), exploratory analytics, OLAP, reporting, … Example – Creating a logical warehouse Information virtualization hides the complexities of where the data is located. Here different repositories are being used to host different workloads, but this complexity is hidden by the information virtualization layer.

×