SlideShare a Scribd company logo
1 of 20
Big Data Ecosystem 
Ivo Vachkov 
Xi Group Ltd.
Big Data ??? 
 Definition 
 The 3Vs: 
 Volume 
 Velocity 
 Variety 
 Added later: 
 Veracity 
 Variability 
 Complexity
Processing Paradigms 
 Batch Processing 
 Large volumes 
 Lower volatility 
 Incremental updates 
 Real-time Processing 
 Smaller volumes 
 Higher volatility 
 Possible full regeneration
The Data Path 
 From Collection … 
 … to Processing … 
 … to Query: 
 Consumption 
 Visualization 
 [Predictive] Analysis 
 Monitoring / Validation 
 ETL, anyone?!
The Data Path
Data Path / Collection 
 Multiple sources (RDBMS, Logs, activity streams, message 
queues, time series, etc.) 
 Multiple types (structured, unstructured, free text, bags of 
words, raw, normalized, etc.) 
 Collection starts with raw data and produces digital 
artifacts suitable for machine processing.
Data Path / Collection 
 Wide variety of components and technologies: 
 Flat files, binary formats (AVRO, CSV, etc.) on a typical file 
system 
 Cluster-specific file systems 
 RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases, 
Document Databases 
 Column Stores 
 Key-Value Stores 
 Time Series Stores 
 Streaming and transformation engines
Data Path / Processing 
 Different processing paradigms: 
 Batch Processing 
 Real-time Processing 
 Multiple expected outcomes: 
 Data 
 Action 
 Different destinations: 
 Data stores 
 Data-driven Control Planes
Data Path / Processing 
 Smaller number of technologies: 
 Map / Reduce (Hadoop, CouchDB, MongoDB, Riak) 
 Cluster Computing (PMV, MPI, LAM, OpenMP, etc.) 
 HPC / Supercomputing 
 Data parallelism is the key! 
 Data locality is important!
Data Path / Processing 
 The importance of M/R 
 Self-hosted solutions: 
 Apache Hadoop 
 Cloudera, HortonWorks, etc. 
 Cloud-based solutions: 
 AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo) 
 Joyent Manta 
 … many others …
Data Path / Query 
 Processing will create digital artifact 
 Extremely high variety of technologies, components, 
services to deal with those artifacts: 
 SQL interfaces on top of NoSQL stores 
 NoSQL to NoSQL 
 NoSQL to RDBMS 
 Output to 3rd party API services 
 Output to proprietary interfaces 
 … a lot more …
Data Path / Query 
 “Query-friendly” stores: 
 Classical RDBMS, NewSQL 
 Big Table & Column Stores 
 Key-Value Stores 
 Search-oriented services 
 Visualization: 
 3rd party services 
 Tableau 
 HTML5 / JavaScript Dashboards 
 Programming languages / Visualization libraries
Data Path / Query 
 Analysis 
 Reports 
 Trends / Predictions 
 Real-time analytics 
 Data-driven Control Plane 
 Classical Business Intelligence 
 Machine Learning (Mahout) 
 Data Science (usually a fancy term for Statistics)
Big Data & Monitoring 
 Infrastructure Monitoring 
 Well understood 
 Many products 
 Full-Stack Application Monitoring 
 Technical challenges 
 No “one size fits all” solutions 
 Data Quality Monitoring 
 Emerging technologies 
 Home-grown solutions
Big Data & Monitoring 
 Infrastructure Monitoring
Big Data & Monitoring 
 Application Monitoring
Big Data & Monitoring 
 Data Quality Monitoring
… a bag of acronyms … 
 Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, 
Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, 
Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, 
Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, 
OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, 
Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, 
Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, 
Memcache, Foundation DB, … 
 AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, 
ElasticCache, SQS, SWF 
 Joyent: Manta
Piece of advice … 
 Collect relevant data! 
Collecting data for data’s sake only costs money … 
 Use the processing technology that best matches your 
business case! 
Hadoop is pointless if your clients only want fast 
geospatial searches … 
 Consume wisely! 
Knowing that 100% of X is Y means nothing when there 
is only one X …
Conclusion 
Q & 
A

More Related Content

What's hot

DW Appliance
DW ApplianceDW Appliance
DW ApplianceShankar R
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An IntroductionShankar R
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big DataShankar R
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyHarald Erb
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Attributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystAttributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystJack Mardack
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 

What's hot (20)

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
DW Appliance
DW ApplianceDW Appliance
DW Appliance
 
Big data 101
Big data 101Big data 101
Big data 101
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
BigData
BigDataBigData
BigData
 
Bigdata
BigdataBigdata
Bigdata
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Attributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystAttributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner Catalyst
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 

Similar to Big Data Ecosystem

Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisManaging Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisAmazon Web Services
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAmazon Web Services
 
Accion Labs - Big Data Services
Accion Labs - Big Data ServicesAccion Labs - Big Data Services
Accion Labs - Big Data ServicesAccion Labs, Inc.
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 

Similar to Big Data Ecosystem (20)

Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisManaging Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
 
Accion Labs - Big Data Services
Accion Labs - Big Data ServicesAccion Labs - Big Data Services
Accion Labs - Big Data Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Big Data Ecosystem

  • 1. Big Data Ecosystem Ivo Vachkov Xi Group Ltd.
  • 2. Big Data ???  Definition  The 3Vs:  Volume  Velocity  Variety  Added later:  Veracity  Variability  Complexity
  • 3. Processing Paradigms  Batch Processing  Large volumes  Lower volatility  Incremental updates  Real-time Processing  Smaller volumes  Higher volatility  Possible full regeneration
  • 4. The Data Path  From Collection …  … to Processing …  … to Query:  Consumption  Visualization  [Predictive] Analysis  Monitoring / Validation  ETL, anyone?!
  • 6. Data Path / Collection  Multiple sources (RDBMS, Logs, activity streams, message queues, time series, etc.)  Multiple types (structured, unstructured, free text, bags of words, raw, normalized, etc.)  Collection starts with raw data and produces digital artifacts suitable for machine processing.
  • 7. Data Path / Collection  Wide variety of components and technologies:  Flat files, binary formats (AVRO, CSV, etc.) on a typical file system  Cluster-specific file systems  RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases, Document Databases  Column Stores  Key-Value Stores  Time Series Stores  Streaming and transformation engines
  • 8. Data Path / Processing  Different processing paradigms:  Batch Processing  Real-time Processing  Multiple expected outcomes:  Data  Action  Different destinations:  Data stores  Data-driven Control Planes
  • 9. Data Path / Processing  Smaller number of technologies:  Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)  Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)  HPC / Supercomputing  Data parallelism is the key!  Data locality is important!
  • 10. Data Path / Processing  The importance of M/R  Self-hosted solutions:  Apache Hadoop  Cloudera, HortonWorks, etc.  Cloud-based solutions:  AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)  Joyent Manta  … many others …
  • 11. Data Path / Query  Processing will create digital artifact  Extremely high variety of technologies, components, services to deal with those artifacts:  SQL interfaces on top of NoSQL stores  NoSQL to NoSQL  NoSQL to RDBMS  Output to 3rd party API services  Output to proprietary interfaces  … a lot more …
  • 12. Data Path / Query  “Query-friendly” stores:  Classical RDBMS, NewSQL  Big Table & Column Stores  Key-Value Stores  Search-oriented services  Visualization:  3rd party services  Tableau  HTML5 / JavaScript Dashboards  Programming languages / Visualization libraries
  • 13. Data Path / Query  Analysis  Reports  Trends / Predictions  Real-time analytics  Data-driven Control Plane  Classical Business Intelligence  Machine Learning (Mahout)  Data Science (usually a fancy term for Statistics)
  • 14. Big Data & Monitoring  Infrastructure Monitoring  Well understood  Many products  Full-Stack Application Monitoring  Technical challenges  No “one size fits all” solutions  Data Quality Monitoring  Emerging technologies  Home-grown solutions
  • 15. Big Data & Monitoring  Infrastructure Monitoring
  • 16. Big Data & Monitoring  Application Monitoring
  • 17. Big Data & Monitoring  Data Quality Monitoring
  • 18. … a bag of acronyms …  Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …  AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF  Joyent: Manta
  • 19. Piece of advice …  Collect relevant data! Collecting data for data’s sake only costs money …  Use the processing technology that best matches your business case! Hadoop is pointless if your clients only want fast geospatial searches …  Consume wisely! Knowing that 100% of X is Y means nothing when there is only one X …

Editor's Notes

  1. Intro, Abstract, Who am I
  2. Big Data = Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[18] Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.; Big data uses inductive statistics and concepts from nonlinear system identification [19] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[20] to reveal relationships, dependencies and perform predictions of outcomes and behaviors.[19][21] Big data can also be defined as "Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS".
  3. Two distinct processing paradigm that drive different technologies Why one? Why the other? Use cases …
  4. Comes from ETL after all, specific but known.