Big data landscape


Published on

An overview about several technologies which contribute to the landscape of Big Data.

An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online

Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data landscape

  1. 1. Big dataThe technology landscape and its applications. Natalino Busa - 12 Feb. 2013
  2. 2. Outline ● Big Data: Who are thou? ● Big Data: The technology landscape ● Hadoop: Overview ● Analytics & Machine Learning ● Opportunities Natalino Busa - 12 Feb. 2013
  3. 3. Hype cycle on new IT technologies Gartner 2012 Natalino Busa - 12 Feb. 2013
  4. 4. What is big data? DATA (structured and un-structured, Logs, ETL, social) Velocity Diversity Volume BIG DATA Hardware Software Services Infrastructure Marketing (e.g. Unica) RDBMS (Private) Cloud Analytics (Tableau) OLAP Networking Modeling (SAS) Messaging Natalino Busa - 12 Feb. 2013
  5. 5. Big Data Heat map Natalino Busa - 12 Feb. 2013
  6. 6. How big is big?SkyTree (tm) defines: Analytics Requirements Index (ARI) ARI = # Rows × # Columns Time (secs)Where # Rows = Number of records being analyzed # Columns = Number of variables captured in each record Time (secs) = The timeframe within which to complete the analysis Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms ARI = (1000*100)/0.001 = 100 M values/sec Natalino Busa - 12 Feb. 2013
  7. 7. What data?Big Data can imply: ● Complex Data refactoring in Batch (lots of rows) ● Real-Time Event Processing (high-speed responses) ● Multidimensional analisys (lots of parameters) ● ... or any of those three Response time Pa ram ete s rs titie En Natalino Busa - 12 Feb. 2013
  8. 8. More data customers + customers + products + customers + products + surveys + customers + products + surveys + transactions +customers products surveys transactions social messagesDatabase Databases Federated Data Aggregated Data Linked Data Just DataStructured Unstructured ● in todays IT environments there is a gradual shift from structured data to unstructured data RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ? Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data? Natalino Busa - 12 Feb. 2013
  9. 9. Big Data: how to deal with it ● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows) ● Big Data analytics (OLAP, OTAP, BI) ● Big Data modeling (predictive, machine learning) Natalino Busa - 12 Feb. 2013
  10. 10. Big Data at restAnalytical RDBMSs (EDW) Oracle, IBM, and various MPPsHadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table) Batch Real-time Cassandra HBase Analytics Logs HDFS EDW EDW EDW ● Traditional EDW and Distributed ● These systems do not exclude each BigData / NoSQL solutions are others and can coexist to form a full complementary to each other. enterprise level solution. Natalino Busa - 12 Feb. 2013
  11. 11. Big Data at restNo need to get everything out of the hadoop ecosystem:NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP)... hybrid solutions are also possible:HDFS + Cassandra : in-memory analytics + large DFSHDFS + Solr/Lucene: fast text search on a distributed file system Natalino Busa - 12 Feb. 2013
  12. 12. Big Data in motionStream processing // Dataflow architecturesUsed to support the automatic analysis of data-in-motion in real-time or near real-time.- Identify meaningful patterns- Trigger action to respond to them as quickly as possible. - Storm (from twitter) dataflow processing framework ++ multi-language - Akka (from typesafe) dataflow actor framework ++ speed Both are: Distributed, fault-tolerant, streaming Natalino Busa - 12 Feb. 2013
  13. 13. Big Data Landscape Machine Learning on Big Data Unstructured SAS, R over HDFS Mahout REST Logs flume Hbase HiveData Interfaces scribe ● Batch Analytics HDFS ● Visualization MapR BI ● Monitoring ● Marketing sqoop Cassandra Pig EDW hiho Unstructured FS OLAP OTAP Impala ● Real-Time Analytics ● Streaming STORM Natalino Busa - 12 Feb. 2013
  14. 14. Lambda Architecture Logic layer Software as a Service e.g realt-time predictorfrom Natalino Busa - 12 Feb. 2013
  15. 15. Why do machine learning on big data Natalino Busa - 12 Feb. 2013
  16. 16. Machine Learning: What? SIMILARITY SEARCH Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest. PREDICTIVE ANALYTICS Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events. CLUSTERING AND SEGMENTATION Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data.From Natalino Busa - 12 Feb. 2013
  17. 17. Word Counting on Map Reduce Natalino Busa - 12 Feb. 2013
  18. 18. Machine learning on Map Reduce From Natalino Busa - 12 Feb. 2013
  19. 19. Machine learning on Map ReduceFrom Natalino Busa - 12 Feb. 2013
  20. 20. Machine Learning: Use Cases E-Commerce / E-Tailing ● Product Recommendation Engines ● Cross Channel Analytics ● Events/Activity Behavior Segmentation Product Marketing ● Campaign management and optimization ● Market and consumer segmentations ● Pricing Optimization Customer Marketing ● Customer Churn Management ● (Mobile) User Behavior Prediction ● Offer Personalization Natalino Busa - 12 Feb. 2013
  21. 21. Big Data: Opportunities Unstructured Data ● Clustering ● Distributed processing ● Distributed Storage Modeling & Analytics ● Distributed Machine Learning ● Fast Online Analytics Cubes Streaming and Real-Time processing ● Build RT profiles ● Decision trees and Predictions ● Offer Personalization Natalino Busa - 12 Feb. 2013
  22. 22. Thanks linkedin: blog:
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.