SlideShare a Scribd company logo
1 of 36
BIG DATA
INGESTION &
MANIPULATION
KW Big Data Peer2Peer
http://www.meetup.com/KW-Big-Data-Peer2Peer/
Presented by George Long
Who Am I – George Long
■ Software Architect living in KW area
■ 3 decades of software engineering in UK and North America
– Speciality is distributed systems design with Big Data, Cloud and NoSQL
technologies
■ 5th
degree black belt in taekwondo
■ Email : master.geo.san@gmail.com
■ Linkedin : https://ca.linkedin.com/in/mastergeosan
Overview
■ What it actually takes to get the data you need to add that one metric to your
report/dashboard?
■ What's it like to navigate the early conversations of an analytics solution?
■ How is one technology selected over another and how do those selections impact or
define other selections?
Agenda
■ Setting the solution space
■ Sample big data use cases
■ Towards a Big Data Culture
■ Hadoop Tools
– ETL tools for ingest
– Tools for data manipulation
– Publishing Results
Setting the solution space
Ingest Use Case Prerequisites
■ What is your use case – what are you trying to do?
– Are there multiple asks?
– Who are the end-users?
■ Is this a new ask or refinements to existing workflows?
– How responsive is your organisation to change?
– Is there an existing team to manage the solution?
■ How will the data source yield their information?
– What are the network protocols? Frequency, Volume etc.
■ Is the data structured or unstructured?
– Are the formats stable?
■ Do you understand the data life cycle of your data?
– Retention policies, privacy, access control
■ Performance & Availability
– How quickly are results required?
– What are the tolerances for system failure? SLAs?
Source: http://www.rosebt.com/blog/the-data-mining-
process
5 Vs of Big Data
Source: http://iihtofficialblog.blogspot.ca/2014/07/5-vs-of-hadoop-big-data.html
Sample Big Data Use Cases
UC1-LOG – Server log analysis
■ Analysis of data-at-rest
■ Massive volume of logs are captured by syslog and require aggregation by hour/day
■ Results are produced every hour
UC2-MSTR – Monitoring Streaming Events
■ Analysis of data-in-motion
■ Extends UC1-LOG by requiring certain log events are used to notify service impact
– Ie. Generate actionable events from RT analysis of the service logs
■ Results are produced continuously
UC3-PPD – Publishing Production Data
■ Refining data for hosting by services
■ Datasets are massaged and published for consumption by customer facing services
■ Data is merged, refined and published for service hosts to consume
■ Process repeats on demand to accommodate new datasets
Towards a Big Data Culture
Human Side of Analytics
Architecting for Analytics
Source:
Hadoop Tools
Hadoop ECOSYSTEM is evolving
Source: http://hortonworks.com/blog/apache-hadoop-2-is-
ga/
ETL Tools for Ingest
File transfer to HDFS
■ Simple file loads, via the following techniques
– Explicit loading via HDFS commands, eg
■ Hadoop fs put <file>
– Mounting HDFS as Fuse enabled filesystem
■ Note that filesystem is append only writes
■ Note - Manually loaded filesets require manual tracking and clean-up
DB Exchange with Hadoop - Apache Sqoop
■ Apache Sqoop is a tool for transferring data between Hadoop and relational
databases. Use Sqoop to import data from a MySQL or Oracle database into HDFS,
run MapReduce on the data, and then export the data back into an RDBMS. Sqoop
automates these processes, using MapReduce to import and export the data in
parallel with fault-tolerance
■  It offers two-way replication with both snapshots and incremental updates.
■ Note - Sqoop requires detailed schema knowledge and synchronisation
configuration of database accounts. DB needs to be up at time of sqooping.
Log Collection – Apache Flume
■ Flume is distributed system for collecting log data from many sources, aggregating
it, and writing it to HDFS. It is designed to be reliable and highly available, while
providing a simple, flexible, and intuitive programming model based on streaming
data flows. Flume provides extensibility for online analytic applications that process
data stream in situ.
■ maintains a central list of ongoing data flows, stored redundantly in Zookeeper
■ See : http://www.lopakalogic.com/articles/hadoop-articles/log-files-flume-hive/
Queue Ingest - Apache Kafka
■ Apache Kafka is a fast, distributed publish-subscribe messaging system. It is
designed to provide high throughput persistent messaging that’s scalable and
allows for parallel data loads into Hadoop. Its features include the use of
compression to optimize IO performance and mirroring to improve availability,
scalability and to optimize performance in multiple-cluster scenarios.
– Queues decouple systems: Both statically and in time
■ See http://www.slideshare.net/gwenshap/kafka-for-dbas
■ Note - Kafka can buffer source data so the availability of the Hadoop platform can
be relaxed
RT Streaming – Storm or Spark
■ Both Storm and Spark Streaming are open-source frameworks for distributed
stream processing
■ Processing Model, Latency
– Storm processes incoming events one at a time in RT
– Spark Streaming batches up events that arrive within a short time window
before processing them with several seconds of latency
■ Fault Tolerance, Data Guarantees
– Storm tracks individual records and guarantees that each record will be
processed at least once, but allows duplicates
– Spark Streaming provides better support for stateful computation that is
fault tolerant.
■ Batch Layer, which has all the processed batch data from the past
■ Speed Layer or RT feed of similar or same information
■ Serving layer combines the two for transparent access
Tools for data manipulation
Java MR (Map/Reduce)
■ Map/Reduce functionality is accessible via java.
■ Full applications can be developed although the higher level constructs such as pig
and Hive should also be considered as the traditional java development cycles are
usually longer than for the scripting routes.
PIG for transforming unstructured data
■ Pig is a scripting language (pig latin) for processing unstructured datasets. (pigs eat
anything)
– Contrast with HIVE
■ Pig Latin programs run in a distributed fashion on a cluster (programs are complied
into Map/Reduce jobs and executed using Hadoop).
Apache Hive
■ Provides SQL like access to structured HDFS datasets
– Contrast with pig
■ Queries are converted to internal M/R, Tex, Spark jobs (similar to pig)
■ Indexing is supported
■ CRUD support with ACID functionality was added
■ Query language can be extended with User Defined Functions (UDFs)
Mahout – Machine Learning
■ Mahout is a library of scalable machine-learning algorithms, implemented on top of
Apache Hadoop®  and using the MapReduce paradigm.
■ Mahout supports four main data science use cases:
– Collaborative filtering – mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
– Clustering – takes items in a particular class (such as web pages or
newspaper articles) and organizes them into naturally occurring groups, such
that items belonging to the same group are similar to each other
– Classification – learns from existing categorizations and then assigns
unclassified items to the best category
– Frequent itemset mining – analyzes items in a group (e.g. items in a
shopping cart or terms in a query session) and then identifies which items
typically appear together
Publishing Results
Hbase – Hadoop’s NoSQL DB
■ HBase provides near real-time, random read and write access to tables (or to be
more accurate ‘maps’) storing billions of rows and millions of columns.
– Contrast with Cassandra
■ HBase runs on the Hadoop cluster without the needed for additional cluster
deployments
■ Access is dependent on the availability of the cluster
Apache Cassandra – distributed DB
■ Apache Cassandra is a massively scalable open source non-relational database that
offers continuous availability, linear scale performance, operational simplicity and
easy data distribution across multiple data centers and cloud availability zones.
– Contrast with Cassandra
■ Hadoop deployments are tied to the data center. Replicate the results to multiple
sites via the eventual consistency of Cassandra.
Backup
Phases of Data Ingestion
Architecting Big Data Ingest & Manipulation

More Related Content

What's hot

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.Data Con LA
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 

What's hot (20)

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 

Viewers also liked

Thoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency ModelsThoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency Modelsiammutex
 
Basic data ingestion in r
Basic data ingestion in rBasic data ingestion in r
Basic data ingestion in rJacob Rideout
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteRoger Barga
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformNavneet Gupta
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudNeeraj Sabharwal
 
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnbalexismidon
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
I-Tier: Breaking Up the Monolith @ Philly ETE
I-Tier: Breaking Up the Monolith @ Philly ETEI-Tier: Breaking Up the Monolith @ Philly ETE
I-Tier: Breaking Up the Monolith @ Philly ETESean McCullough
 
(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...
(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...
(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...Amazon Web Services
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Cwin16 - Paris- m rapid
Cwin16 - Paris- m rapidCwin16 - Paris- m rapid
Cwin16 - Paris- m rapidCapgemini
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 

Viewers also liked (16)

Thoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency ModelsThoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency Models
 
Basic data ingestion in r
Basic data ingestion in rBasic data ingestion in r
Basic data ingestion in r
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data Platform
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnb
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
I-Tier: Breaking Up the Monolith @ Philly ETE
I-Tier: Breaking Up the Monolith @ Philly ETEI-Tier: Breaking Up the Monolith @ Philly ETE
I-Tier: Breaking Up the Monolith @ Philly ETE
 
(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...
(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...
(PFC304) Effective Interprocess Communications in the Cloud: The Pros and Con...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Cwin16 - Paris- m rapid
Cwin16 - Paris- m rapidCwin16 - Paris- m rapid
Cwin16 - Paris- m rapid
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 

Similar to Architecting Big Data Ingest & Manipulation

Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big DataSeval Çapraz
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Apache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxApache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxMiraj Godha
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 

Similar to Architecting Big Data Ingest & Manipulation (20)

Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Big Data
Big DataBig Data
Big Data
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Apache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxApache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptx
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

Architecting Big Data Ingest & Manipulation

  • 1. BIG DATA INGESTION & MANIPULATION KW Big Data Peer2Peer http://www.meetup.com/KW-Big-Data-Peer2Peer/ Presented by George Long
  • 2. Who Am I – George Long ■ Software Architect living in KW area ■ 3 decades of software engineering in UK and North America – Speciality is distributed systems design with Big Data, Cloud and NoSQL technologies ■ 5th degree black belt in taekwondo ■ Email : master.geo.san@gmail.com ■ Linkedin : https://ca.linkedin.com/in/mastergeosan
  • 3. Overview ■ What it actually takes to get the data you need to add that one metric to your report/dashboard? ■ What's it like to navigate the early conversations of an analytics solution? ■ How is one technology selected over another and how do those selections impact or define other selections?
  • 4. Agenda ■ Setting the solution space ■ Sample big data use cases ■ Towards a Big Data Culture ■ Hadoop Tools – ETL tools for ingest – Tools for data manipulation – Publishing Results
  • 6. Ingest Use Case Prerequisites ■ What is your use case – what are you trying to do? – Are there multiple asks? – Who are the end-users? ■ Is this a new ask or refinements to existing workflows? – How responsive is your organisation to change? – Is there an existing team to manage the solution? ■ How will the data source yield their information? – What are the network protocols? Frequency, Volume etc. ■ Is the data structured or unstructured? – Are the formats stable? ■ Do you understand the data life cycle of your data? – Retention policies, privacy, access control ■ Performance & Availability – How quickly are results required? – What are the tolerances for system failure? SLAs?
  • 7.
  • 9. 5 Vs of Big Data Source: http://iihtofficialblog.blogspot.ca/2014/07/5-vs-of-hadoop-big-data.html
  • 10. Sample Big Data Use Cases
  • 11. UC1-LOG – Server log analysis ■ Analysis of data-at-rest ■ Massive volume of logs are captured by syslog and require aggregation by hour/day ■ Results are produced every hour
  • 12. UC2-MSTR – Monitoring Streaming Events ■ Analysis of data-in-motion ■ Extends UC1-LOG by requiring certain log events are used to notify service impact – Ie. Generate actionable events from RT analysis of the service logs ■ Results are produced continuously
  • 13. UC3-PPD – Publishing Production Data ■ Refining data for hosting by services ■ Datasets are massaged and published for consumption by customer facing services ■ Data is merged, refined and published for service hosts to consume ■ Process repeats on demand to accommodate new datasets
  • 14. Towards a Big Data Culture
  • 15. Human Side of Analytics
  • 18. Hadoop ECOSYSTEM is evolving Source: http://hortonworks.com/blog/apache-hadoop-2-is- ga/
  • 19. ETL Tools for Ingest
  • 20. File transfer to HDFS ■ Simple file loads, via the following techniques – Explicit loading via HDFS commands, eg ■ Hadoop fs put <file> – Mounting HDFS as Fuse enabled filesystem ■ Note that filesystem is append only writes ■ Note - Manually loaded filesets require manual tracking and clean-up
  • 21. DB Exchange with Hadoop - Apache Sqoop ■ Apache Sqoop is a tool for transferring data between Hadoop and relational databases. Use Sqoop to import data from a MySQL or Oracle database into HDFS, run MapReduce on the data, and then export the data back into an RDBMS. Sqoop automates these processes, using MapReduce to import and export the data in parallel with fault-tolerance ■  It offers two-way replication with both snapshots and incremental updates. ■ Note - Sqoop requires detailed schema knowledge and synchronisation configuration of database accounts. DB needs to be up at time of sqooping.
  • 22. Log Collection – Apache Flume ■ Flume is distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows. Flume provides extensibility for online analytic applications that process data stream in situ. ■ maintains a central list of ongoing data flows, stored redundantly in Zookeeper ■ See : http://www.lopakalogic.com/articles/hadoop-articles/log-files-flume-hive/
  • 23. Queue Ingest - Apache Kafka ■ Apache Kafka is a fast, distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize IO performance and mirroring to improve availability, scalability and to optimize performance in multiple-cluster scenarios. – Queues decouple systems: Both statically and in time ■ See http://www.slideshare.net/gwenshap/kafka-for-dbas ■ Note - Kafka can buffer source data so the availability of the Hadoop platform can be relaxed
  • 24. RT Streaming – Storm or Spark ■ Both Storm and Spark Streaming are open-source frameworks for distributed stream processing ■ Processing Model, Latency – Storm processes incoming events one at a time in RT – Spark Streaming batches up events that arrive within a short time window before processing them with several seconds of latency ■ Fault Tolerance, Data Guarantees – Storm tracks individual records and guarantees that each record will be processed at least once, but allows duplicates – Spark Streaming provides better support for stateful computation that is fault tolerant.
  • 25. ■ Batch Layer, which has all the processed batch data from the past ■ Speed Layer or RT feed of similar or same information ■ Serving layer combines the two for transparent access
  • 26. Tools for data manipulation
  • 27. Java MR (Map/Reduce) ■ Map/Reduce functionality is accessible via java. ■ Full applications can be developed although the higher level constructs such as pig and Hive should also be considered as the traditional java development cycles are usually longer than for the scripting routes.
  • 28. PIG for transforming unstructured data ■ Pig is a scripting language (pig latin) for processing unstructured datasets. (pigs eat anything) – Contrast with HIVE ■ Pig Latin programs run in a distributed fashion on a cluster (programs are complied into Map/Reduce jobs and executed using Hadoop).
  • 29. Apache Hive ■ Provides SQL like access to structured HDFS datasets – Contrast with pig ■ Queries are converted to internal M/R, Tex, Spark jobs (similar to pig) ■ Indexing is supported ■ CRUD support with ACID functionality was added ■ Query language can be extended with User Defined Functions (UDFs)
  • 30. Mahout – Machine Learning ■ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop®  and using the MapReduce paradigm. ■ Mahout supports four main data science use cases: – Collaborative filtering – mines user behavior and makes product recommendations (e.g. Amazon recommendations) – Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other – Classification – learns from existing categorizations and then assigns unclassified items to the best category – Frequent itemset mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together
  • 32. Hbase – Hadoop’s NoSQL DB ■ HBase provides near real-time, random read and write access to tables (or to be more accurate ‘maps’) storing billions of rows and millions of columns. – Contrast with Cassandra ■ HBase runs on the Hadoop cluster without the needed for additional cluster deployments ■ Access is dependent on the availability of the cluster
  • 33. Apache Cassandra – distributed DB ■ Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. – Contrast with Cassandra ■ Hadoop deployments are tied to the data center. Replicate the results to multiple sites via the eventual consistency of Cassandra.
  • 35. Phases of Data Ingestion

Editor's Notes

  1. November 3rd, 2015
  2. Source: http://itblog.emc.com/2013/08/13/the-big-data-wilderness-finding-your-way-starts-with-asking-the-right-questions/
  3. http://www.b-eye-network.com/blogs/eckerson/archives/hadoop_and_nosq/
  4. Source: http://xinhstechblog.blogspot.ca/2014/06/storm-vs-spark-streaming-side-by-side.html
  5. Source : https://voltdb.com/blog/simplifying-complex-lambda-architecture https://www.youtube.com/watch?v=rE0KGHbh7ZQ
  6. Source : https://www.datatorrent.com/dtingest-arrival-scalable-fault-tolerant-bigdata-ingestion/