Big Data and Fast Data combined – is it possible ? Introduction aux architectures Big Data. M. Ulises Fasoli, Senior Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum du 24 novembre 2015 à Lausanne
The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology
This is the presentation from Big Data November Bangalore Meetup 2014.
http://technology.inmobi.com/events/bigdata-meetup
Talk Outline:
- What does THE HIVE provide?
- Goals of Synapse Tech Stack
- THE HIVE Startups
- Demystifying IoT Market
- Synapse Stack for IoT
- Big Data Challenge
- Synapse Lambda Architecture
- Synapse Components
- Synapse Internals
- AKILI – Synapse Machine Learning
Introduction à la gouvernance de données, Philippe Bourgeois, Senior Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum, du 24 novembre 2015 à Lausanne
How do you analyze a Petabyte of data?
The Spark Python API or PySpark exposes the Spark programming model to Python. Apache® Spark™ is open-source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds for massive amounts of data.
We’ll also look into Spark SQL - Apache Spark’s module for working with structured data and MLlib - Apache Spark’s scalable machine learning library.
What will you learn?
Perform Big Data analysis with PySpark
Use SQL queries with DataFrames by using the Spark SQL module
Use Machine learning with MLlib library
Bigdata.
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."[2] Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics simulations, biology and environmental research.[5]
Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks.[6][7] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[8] as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated.[9] One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.[10]
Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require "massively parallel software running on tens, hundreds, or even thousands of servers".[11] What counts as "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
Sr. Architect Pradeep Reddy, from Qubole, presents the state of Data Science in the enterprise industries today, followed by deep dive of an end-to-end real world machine learning use case. We'll explore the best practices and challenges of big data operations when developing new machine learning features and advanced analytics products at scale in the cloud.
Big Data and Fast Data combined – is it possible ? Introduction aux architectures Big Data. M. Ulises Fasoli, Senior Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum du 24 novembre 2015 à Lausanne
The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology
This is the presentation from Big Data November Bangalore Meetup 2014.
http://technology.inmobi.com/events/bigdata-meetup
Talk Outline:
- What does THE HIVE provide?
- Goals of Synapse Tech Stack
- THE HIVE Startups
- Demystifying IoT Market
- Synapse Stack for IoT
- Big Data Challenge
- Synapse Lambda Architecture
- Synapse Components
- Synapse Internals
- AKILI – Synapse Machine Learning
Introduction à la gouvernance de données, Philippe Bourgeois, Senior Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum, du 24 novembre 2015 à Lausanne
How do you analyze a Petabyte of data?
The Spark Python API or PySpark exposes the Spark programming model to Python. Apache® Spark™ is open-source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds for massive amounts of data.
We’ll also look into Spark SQL - Apache Spark’s module for working with structured data and MLlib - Apache Spark’s scalable machine learning library.
What will you learn?
Perform Big Data analysis with PySpark
Use SQL queries with DataFrames by using the Spark SQL module
Use Machine learning with MLlib library
Bigdata.
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."[2] Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics simulations, biology and environmental research.[5]
Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks.[6][7] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[8] as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated.[9] One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.[10]
Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require "massively parallel software running on tens, hundreds, or even thousands of servers".[11] What counts as "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
Sr. Architect Pradeep Reddy, from Qubole, presents the state of Data Science in the enterprise industries today, followed by deep dive of an end-to-end real world machine learning use case. We'll explore the best practices and challenges of big data operations when developing new machine learning features and advanced analytics products at scale in the cloud.
GITEX Big Data Conference 2014 – SAP PresentationPedro Pereira
Big, Fast and Predictive Data: How to Extract Real Business Value – in real time.
90% of the world’s data was created in the last two years. If you can harness it, it will revolutionize the way you do business. Big Data solutions can help extract real business value – in real time.
Slide du petit déjeuner du 11 décembre 2013
Dans un contexte économique délicat, les outils du « big data » apportent toute la rapidité, la souplesse et la scalabilité requise pour mettre en oeuvre des projets d'entreprise tirant profit de volumes d'information importants. Ces technologies sont désormais une réalité à intégrer aux projets SI.
La société Klee Group organise ce déjeuner thématique en proposant des intervenants du Big Data :
- Mongo DB
- Elasticsearch
- CMS Rubedo
Machines learn better with Semantics!
See how taxonomy management and the maintenance of knowledge graphs benefit from machine learning and corpus analysis, and how, in return, machine learning gets improved when using semantic knowledge models for further enrichment.
The core idea behind Hadoop is to distribute both the data and user software on individual shards within the cluster. The Bigdata Replay method is drastically different in that it packs user software into batches on a single multicore machine and uses circuit emulation to maximize throughout when bringing data shards for replay. The effect from hotspots, defined as drastically higher access frequency to a small portion of (popular) data, is different in the two platforms. This paper models the difference numerically but in a relative form, which makes it possible to compare the two platforms.
Green Compute and Storage - Why does it Matter and What is in ScopeNarayanan Subramaniam
Presentation made for BITS students under the auspices of IEEE Goa on the account of Lumini '21 - BITS Goa's annual technical symposium. Topic provides an overview as to why green compute/storage is important as the Internet explodes with voice, video and other content consuming 8% (3 TWh) of total global electricity production rising exponentially to 21% (9 TWh) by 2030. This is likely to be accelerated with the advent of 5G and IoT everywhere. I explore 3 key pillars of computing with respect to "green" and the consequences that need to be mitigated in short order.
The rise of “Big Data” on cloud computing: Review and open research issues
Paper Link: https://www.researchgate.net/publication/264624667_The_rise_of_Big_Data_on_cloud_computing_Review_and_open_research_issues
Unlock Your Data for ML & AI using Data VirtualizationDenodo
How Denodo Complement’s Logical Data Lake in Cloud
● Denodo does not substitute data warehouses, data lakes,
ETLs...
● Denodo enables the use of all together plus other data
sources
○ In a logical data warehouse
○ In a logical data lake
○ They are very similar, the only difference is in the main
objective
● There are also use cases where Denodo can be used as data
source in a ETL flow
Big Data: Big Numbers Bigger Questions, A presentation at Big Data WeekChloe Thomas
This is a presentation by Simon Spyer from Conduit at the London Big Data Week.
Conduit are Data Value Architects. We transform business through the application of data.
Find out more at www.conduitltd.com and @UkConduit.
GITEX Big Data Conference 2014 – SAP PresentationPedro Pereira
Big, Fast and Predictive Data: How to Extract Real Business Value – in real time.
90% of the world’s data was created in the last two years. If you can harness it, it will revolutionize the way you do business. Big Data solutions can help extract real business value – in real time.
Slide du petit déjeuner du 11 décembre 2013
Dans un contexte économique délicat, les outils du « big data » apportent toute la rapidité, la souplesse et la scalabilité requise pour mettre en oeuvre des projets d'entreprise tirant profit de volumes d'information importants. Ces technologies sont désormais une réalité à intégrer aux projets SI.
La société Klee Group organise ce déjeuner thématique en proposant des intervenants du Big Data :
- Mongo DB
- Elasticsearch
- CMS Rubedo
Machines learn better with Semantics!
See how taxonomy management and the maintenance of knowledge graphs benefit from machine learning and corpus analysis, and how, in return, machine learning gets improved when using semantic knowledge models for further enrichment.
The core idea behind Hadoop is to distribute both the data and user software on individual shards within the cluster. The Bigdata Replay method is drastically different in that it packs user software into batches on a single multicore machine and uses circuit emulation to maximize throughout when bringing data shards for replay. The effect from hotspots, defined as drastically higher access frequency to a small portion of (popular) data, is different in the two platforms. This paper models the difference numerically but in a relative form, which makes it possible to compare the two platforms.
Green Compute and Storage - Why does it Matter and What is in ScopeNarayanan Subramaniam
Presentation made for BITS students under the auspices of IEEE Goa on the account of Lumini '21 - BITS Goa's annual technical symposium. Topic provides an overview as to why green compute/storage is important as the Internet explodes with voice, video and other content consuming 8% (3 TWh) of total global electricity production rising exponentially to 21% (9 TWh) by 2030. This is likely to be accelerated with the advent of 5G and IoT everywhere. I explore 3 key pillars of computing with respect to "green" and the consequences that need to be mitigated in short order.
The rise of “Big Data” on cloud computing: Review and open research issues
Paper Link: https://www.researchgate.net/publication/264624667_The_rise_of_Big_Data_on_cloud_computing_Review_and_open_research_issues
Unlock Your Data for ML & AI using Data VirtualizationDenodo
How Denodo Complement’s Logical Data Lake in Cloud
● Denodo does not substitute data warehouses, data lakes,
ETLs...
● Denodo enables the use of all together plus other data
sources
○ In a logical data warehouse
○ In a logical data lake
○ They are very similar, the only difference is in the main
objective
● There are also use cases where Denodo can be used as data
source in a ETL flow
Big Data: Big Numbers Bigger Questions, A presentation at Big Data WeekChloe Thomas
This is a presentation by Simon Spyer from Conduit at the London Big Data Week.
Conduit are Data Value Architects. We transform business through the application of data.
Find out more at www.conduitltd.com and @UkConduit.
This presentation shows how Big Data impacts business and technology and asks (and maybe answers) the question: how new is Big Data and the effects it causes... ?
"Performance de l'ingénierie, l'approche Thalès" est une présentation de Françoise Nahabetian Directeur Consulting Excellence opérationnelle chez Thalès Consulting, au cours de la table ronde "expérience client, l'homme au cœur de la transformation" de L'Observatoire de l'Excellence Opérationnelle du 3 Décembre 2015.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
Datumize is a software vendor established in 2014 in Barcelona (Spain) working on data integration technology.
We develop innovative products that allow companies to enjoy actionable insights based on Dark Data - data not stored and therefore not used.
Our secret sauce is a proprietary and powerful data collection engine, Datumize Data Collector (DDC), that gets data from fancy sources that most other vendors do not consider.
How to get started with IoT? Learn how Ayla Networks and mnubo integrate to bring product manufacturers with a turnkey IoT Connectivity and Analytics solution.
Given the data center industry’s cagey nature – the secrecy around critical infrastructure, the NDAs, and so on – we can’t make specific predictions without substantial risk of looking like total fools. But from conversations with vendors and analysts we can at a minimum get some idea of the directions data center technologies are moving in.
Watch full webinar here: https://bit.ly/2vN59VK
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
Attend this session to learn:
- What data virtualization really is.
- How it differs from other enterprise data integration technologies.
- Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations.
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBig Data Value Association
The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. To this end, BDV PPP projects I-BiDaaS, BigDataStack, Track & Know and Policy Cloud deliver innovative technologies to address the emerging needs of data operations and applications. To fully exploit the sustainability and take full advantage of the developed technologies, the projects onboarded pilots that exhibit their applicability in a wide variety of sectors. In the Big Data Pilot Demo Days, the projects will showcase the developed and implemented technologies to interested end-users from the industry as well as technology providers, for further adoption.
Is it sensible to use Data Vault at all? Conclusions from a project.Capgemini
The presentation focuses on the question “Is it sensible to use Data Vault at all?” The author outlines the impact of Data Vault on the architecture, the implementation and on the project.
Webinar Industrial Data Space Association: Introduction and ArchitectureThorsten Huelsmann
Industrial Data Space Association is an industry and user driven initiative to develop a global Industrial Data Space standard and reference architecture which provides data sovereignty. The work bases on use cases and supports certifiable software solutions and business models for the data economy. The Webinar by Lars Nagel and Sebastian Steinbuss gives and overview to the Industrial Data Space initiative and explains the Reference Architecture and ist main components.
Guest Speaker in the 2nd National level webinar titled "Big Data Driven Solutions to Combat Covid 19" on 4th July 2020, Ethiraj College for Women(Auto), Chennai.
The emergence of social, mobile, cloud, big data and analytics are fundamentally changing how we live, work and interact.
Mobile devices are ubiquitous. Changing consumer behaviors, supplanting PCs, generating massive amounts of data and putting new demands on the enterprise to not only support these devices but to adjust the way they do business.
Social technologies are changing the way we interact, communicate and share information – equally generating vast amounts of data and impacting business as they try to unlock the full potential social has to offer.
Cloud technologies bring new scale and efficiency to service delivery and enable more agile ways of doing business and drive business model innovation. For companies, It also brings information and applications to people at the right time and place.
All of these trends are fueling an explosion of data. Not only do enterprises need to store, manage and secure this data, they also need to derive meaningful insight from these vast amounts of data. Data is the basis of significant opportunity and a source of competitive advantage for all organizations. Data is a new economic asset, the next natural resource.
These trends are spawning new workloads, business processes and technology deployments that are putting unprecedented demands on our IT environments.
Digital transformations require a new hybrid cloud—one that’s open by design, and frees clients to choose and change environments, data and services as needed. This approach allows cloud apps and services to be rapidly composed using the best relevant data and insights available, while maintaining clear visibility, control and security—everywhere. How do you decide where to put data on a hybrid cloud and how to use it? What’s the best hybrid cloud strategy in terms of data and workload? How should you leverage a 50/50 rule or a 80/20 rule and user interaction to evaluate which data/workload to move to the cloud and which data/workload to keep on-premise? Hybrid cloud provides an open platform for innovation, including cognitive computing. Organizations are looking for taking shadow IT out of the shadows by providing a self-service way to the information and a hybrid cloud strategy is allowing that. Also, how to use hybrid cloud for better manage data sovereignty & compliance?
CA is helping the application economy. Data is the fuel of the application economy – what customers, partners, employees demand. Real business needs for big data: This is about GROWTH for companies. Top line. Better customer experiences, new customers, new revenue. Ultimately mission critical.
Consequently companies are spinning up new projects. Lots in the pipeline. 84% of you have projects to be deployed in next 1 year.
Everything counts, structured/unstructured: 94% of you plan to use all data available – systems of record (e.g. MF), unstructured, everything. And everything has changed – tools, technology, processes & people.
Conquer complexity by getting the Big Data big picture here: http://cainc.to/BigData
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Big data presentation, explanations and use cases in industrial sector
1. Big Data
explanations
&
use cases in
industrial sector
September 2015
Nicolas SARRAMAGNA
https://fr.linkedin.com/pub/nicolas-sarramagna/19/941/587
2. CONTENTS
What’s Big Data ?
1. Definition, 3 V
2. General use cases
3. Technologies used
4. Market Overview
Big Data in Industrial sector
1. What for ?
2. Vision
3. Demo Poc / PoV
3. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data – 3V
SEPTEMBER 2015
3
BIG DATA :
New contexts on data -> 3V
New business ambitions, new technologies
VOLUME : MASSIFICATION AND AUTOMATION OF DATA EXCHANGES
80% data created last 12 months
30 billions of contents on FB each month, Flickr 5 billions of page, 2 billions videos read on sur Youtube each day
VARIETY : MULTIPLICATION OF SOURCES AND TYPES
Mails, documents, logs (applications, networks, systems), databases, sensor data, open data, social networks,
blogs, forums, articles, browsing history, geolocation data, …
Structured data (DB), semi-structured (html page, tweet, xml), unstructured (mail content, excel, ppt, video, audio)
VELOCITY : NEED TO COLLECT AND PROCESS DATA IN REAL TIME
Risk management (fraud, security of the SI – SIEM)
Real time route optimization
Personalized advertising
4. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data – new technologies
SEPTEMBER 2015
4
BIG DATA :
More efficient components but also throughput I/O -> grid architecture
New technological knowledge : storage of large volumes of data in a cluster at a lower cost, distributed computing,
data mining industrialized, on-demand IT architecture with the cloud
ORIGIN OF BIG DATA
index the web and search engine for Google, Yahoo - years ~2006
5. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - general use cases IT
SEPTEMBER 2015
5
COMPLETE THE ARCHITECTURE OF THE DATA
Vision of a Data lake / Enterprise data hub
Bringing closer data applications and not duplicate data for each application
"Deliver" managed data
REDUCE STORAGE COSTS AND COMPUTING COSTS
Big Data technologies use commodity hardware and / or cloud and parallel computing
STRONG TECHNICAL CONSTRAINTS
Manage + 1000 transactions / seconde
Flow of + 1000 events to collect / seconde
Computing + 10 threads /core cpu
Storage of data set +10To for actions
Require major adaptations and material logic without big data technologies
6. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - general use cases business
SEPTEMBER 2015
6
END-USER CENTRIC
Products recommendation
Optimization of ads
PROCESS CENTRIC
Detection of unexpected events : fraud, network, predictive maintenance
Path optimization
DIVERSIFICATION OF THE BUSINESS MODEL
Orange : resale of geolocation data
7. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data – misconceptions
SEPTEMBER 2015
7
Only used for
unstructured data
Only needed for
massive data sets
Only available from
open-source
Replaces my current
BI platform
Used with structured
and unstructured data
To store and analyse
all size of data
It is complimentary to
our existing BI
strategy and
investments
Big Data will become esential for Business Intelligence
All big editors are on
the bridge
9. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data – BI opportunities
SEPTEMBER 2015 FOOTER CAN BE PERSIZED AS FOLLOW: INSERT / HEADER AND FOOTER
9
THE PAST - BI
BIG DATA ANALYTICS
10. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - technologies under the hood - standard Hadoop
SEPTEMBER 2015 FOOTER CAN BE PERSONALIZED AS FOLLOW: INSERT / HEADER AND FOOTER
10
PLATEFORME HADOOP
11. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - technologies under the hood
SEPTEMBER 2015 FOOTER CAN BE PERSONALIZED AS FOLLOW: INSERT / HEADER AND FOOTER
11
COLLECT
Spark, flume, Sqoop
Inject data into HDFS and NoSql DB : command line, API REST, API Java, streaming injection, massive injection,
from RDBMS injection
STORAGE
Cloud, Hadoop -> distributed file system HDFS (large and small data set)
NoSql, : not only sql : db distributed, schema-less : CAP theorem, DB key-value, column, document, graph oriented
12. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - technologies under the hood
SEPTEMBER 2015
12
ANALYSIS
Data Science, Map / Reduce, Spark
Analysis, clean data
Goal : build a model
Machine Learning : 1 data set to train the model (67% of the data set), 1 data set to evaluate the model (33%)
VISUALIZATION
DataViz : all visual representation techniques to do data mining.
Build indicators decision easier
Give indicator whatever size or type of data
Innovate : give new perspectives to discover new opportunities
Tableau, QlikView, Power Pivot
Take data with ODBC connector, JDBC connector, API REST, native connector of the DataViz tool
13. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - technologies under the hood
SEPTEMBER 2015
13
CONCEPTS OF A BIG DATA ARCHITECTURE
Data and actions distributed : the file-system, jobs (Map/Reduce, Spark, …) , databases (noSql)
Data and actions co-location : replication, treatments strategy in Hadoop
Horizontal elasticity : master / nodes architecture
Shared nothing : when a node breaks down, no data is lost. Each node is independent.
Design for failure : when a node breaks down, the cluster continues to work.
14. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data - technologies under the hood
SEPTEMBER 2015 FOOTER CAN BE PERSONALIZED AS FOLLOW: INSERT / HEADER AND FOOTER
14
HDFS : HADOOP DISTRIBUTED FILE SYSTEM
Name node : master of the system. Maintains and manages blocks presents on the datanodes
Data nodes : slaves deployed on each machine and provide actual storage. Serve read and write requests for the
clients
15. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
What’s Big Data – technologies under the hood - storage costs
SEPTEMBER 2015 FOOTER CAN BE PERSIZED AS FOLLOW: INSERT / HEADER AND FOOTER
15
USE COMMODITY HARDWARE
In Big Data, the data center is not a collection of servers but is a collection of co-located cpus, ram and local disks
1 MILLION $ GETS ->
16. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
COTS DISTRIBUTION
Cloudera, n°1
Hortonworks, n°2
MapR, n°3
CLOUD (BASED ON A DISTRIB)
Microsoft – Azure
Amazon - AWS
APPLIANCE EDITEURS, COSTS++
Terradata
Oracle
What’s Big Data - market Overview
SEPTEMBER 2015
16
leaders
17. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
CLOUDERA
Business model editor, 5-6k€ / year / node
Amazon deploy Cloudera
Better maturity than others distributions
HORTONWORKS
Free, business model based on support : 15k€ / year / slot of 4 nodes or per slot of 50To
Azure, Amzon deploy Hortonworks
Less mature than Cloudera on security, administration
MAPR
Business model editor
Divergence with the standard Hadoop
Big Data – positioning of the distributions
SEPTEMBER 2015
17
0
20
40
60
80
100
Cloudera
Hortonworks
MapR
Between distributions, ratio 1 to 4
18. CONTENTS
What’s Big Data ?
1. Definition, 3 V
2. Use cases
3. Technologies under the hood
4. Market Overview
Big Data in Industrial sector
1. What for ?
2. Vision
3. Demo Poc / PoV
19. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – What for ? - use cases IT
BUILD A DATA LAKE
Reduce cost, move cold data from DataWarehouse
Break the storage of the data in silos
Stock raw data and can work (data mining) with all of the data
Open the data, enrich them with metadata
LOG ANALYSIS AND MONITORING - SIEM
Monitoring of applications, networks, systems logs -> Splunk
PREDICTIVE MAINTENANCE
Monitoring of sensor data, predict breakdowns inter plants
SEPTEMBER 2015 FOOTER CAN BE PERSONALIZED AS FOLLOW: INSERT / HEADER AND FOOTER
19
20. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – What for ? - use cases HR
SKILLS VISION AND MANAGEMENT
Cross informations from professional networks : viadeo, linkedin and internal HR informations : build a map of the
skills in PO
Build and manage groups of skills, enrich internal RH tools
E REPUTATION
Follow in real time the data about your brand, about the competitors, the customers
Monitoring of social networks (twitter, facebook), press news, financial news, forums, blogs, …
Quickly react in according with the results if necessary
SEPTEMBER 2015
20
21. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – What for ? - use cases Marketing
VISION 360 OF CUSTOMERS, SUPPLIERS, COMPETITORS
Have as much information about a company : social, legal, financial, competitive position.
Evaluate risk, opportunity to work together
VISION OF THE ROI OF PLANTS
Real-time indicators from plants : invest, number of bumpers, tanks
Rank the plants, predict gain
SEPTEMBER 2015
21
22. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – Vision & Roadmap
2016 : BEGIN TO BUILD A DATA LAKE
Make the data directly available for BI, Data Science and / or to transfer it in a Datawarehouse
Collect data and manage it (who has access, metadata)
Infrastructure : hybrid with cloud / on premise / appliance ?
2016 : CREATE A NEW CROSS-DIVISION SERVICE AROUND THE DATA
DataViz : create reporting, use your current dataViz tools -> current BI analyst, no change
Data IS : know his data and could give metadata to classify it -> current IS , no change
Data engineer : use collecting tools, coding jobs, transform data -> new skills
Data Administrator IT : Big Data architecture integration and monitoring -> new skills
Data Analysis & data mining : cross analysis the data, apply models, design indicators to the dataViz -> new skills
2016+ : IMPLEMENT OTHER USER CASES
Begin small and accelerate
SEPTEMBER 2015
22
23. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – Data Lake
DATA LAKE / ENTERPRISE DATA HUB / DATA RESERVOIR
Low cost storage of heterogeneous data (semi, non-structured and structured data)
Raw data storage but data enriched and classified by metadata – a data reservoir, not a SWAMP
Used for data exploration, analysis and data mining
Data schema on read : old ETL, new ELT
Can be directly used for BI (ELT mode)
DATA LAKE AND DATA WAREHOUSE
Complete the sources of the data warehouse
Could stock cold data from Data Warehouse
Feed the Data Warehouse
DATA LAKE VISION
Stores aggregated data, can stock all the data
Data Lake centric vision : bring applications to Data and not copy Data to Applications
SEPTEMBER 2015
23
24. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – Data Lake - infrastructure
BIG DATA INFRASTRUCTURE
hybrid with cloud : NO if you want to keep your data inside (security), network effort, cloud skills
appliance : infra, license, deployment -> TCO ++
On-premise : best compromise between cost, convenience of deployment and usages.
CHOICE : ON-PREMISE INFRASTRUCTURE
Go for Cloudera (better administration and security functionalities, ‘real-time’ module : Impala) or Hortonworks
Send your IT training : dev, admin, data mining
SEPTEMBER 2015
24
25. COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Big Data in Industrial sector – Proof of Concept – Proof of Value
SEPTEMBER 2015
25
SUBJECT : E-REPUTATION
GOALS
Put in place indicators of e-Reputation of your enterprise/competitors/suppliers/customers
from various sources : news, social network
Experiment of big data tools
INDICATORS
Who speaks about ? How (positive, negative, neutral) ? What’s the content ? Where in the world ? From what
source ?
Different views of e-Reputation : financial, HR, societal, commercial
DEMO
"Big Data" : terme designant une rupture avec le traitement traditionnel de la donnee
Le Big Data permet de solutionner de nouvelles problematiques ou des anciennes d’une meilleure maniere
Goulet d’étranglement sur les accès écriture/lecture disque, le débit disque ne suit la croissance des espaces de stockage
Big Data ne remplace pas l’architecture existante du BI mais la complete et la réoriente : applications vers data et non data (et ses duplications) vers applications
Descriptive , Diagnostic : regarder le passé et trouver les raisons d’un succes ou d’un echec -> BI
Predictive : dégager un modèle qui donne les futurs tendances -> BIG DATA
Prescriptive : sous différentes contraintes, déterminer le meilleur moyen d’y parvenir -> BIG DATA
Raconter le cycle de vie de la donnée selon un ordre chrono depuis la source de données jusqu’à la restit.
Ods : data opérationnelles. Edw : entrepots de données data agrégée.
Datamart : /s ens d’un entrepot. Hdfs système de fichiers distribués.
Event -> Kafka (syst. Message distribue) -> Storm (traitement en tps reel du msg, opt.) -> Nosql