Big data landscape

  • 1,069 views
Uploaded on

An overview about several technologies which contribute to the landscape of Big Data. …

An overview about several technologies which contribute to the landscape of Big Data.

An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online

Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,069
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
92
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big dataThe technology landscape and its applications. Natalino Busa - 12 Feb. 2013
  • 2. Outline ● Big Data: Who are thou? ● Big Data: The technology landscape ● Hadoop: Overview ● Analytics & Machine Learning ● Opportunities Natalino Busa - 12 Feb. 2013
  • 3. Hype cycle on new IT technologies Gartner 2012 Natalino Busa - 12 Feb. 2013
  • 4. What is big data? DATA (structured and un-structured, Logs, ETL, social) Velocity Diversity Volume BIG DATA Hardware Software Services Infrastructure Marketing (e.g. Unica) RDBMS (Private) Cloud Analytics (Tableau) OLAP Networking Modeling (SAS) Messaging Natalino Busa - 12 Feb. 2013
  • 5. Big Data Heat map Natalino Busa - 12 Feb. 2013
  • 6. How big is big?SkyTree (tm) defines: Analytics Requirements Index (ARI) ARI = # Rows × # Columns Time (secs)Where # Rows = Number of records being analyzed # Columns = Number of variables captured in each record Time (secs) = The timeframe within which to complete the analysis Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms ARI = (1000*100)/0.001 = 100 M values/sec Natalino Busa - 12 Feb. 2013
  • 7. What data?Big Data can imply: ● Complex Data refactoring in Batch (lots of rows) ● Real-Time Event Processing (high-speed responses) ● Multidimensional analisys (lots of parameters) ● ... or any of those three Response time Pa ram ete s rs titie En Natalino Busa - 12 Feb. 2013
  • 8. More data customers + customers + products + customers + products + surveys + customers + products + surveys + transactions +customers products surveys transactions social messagesDatabase Databases Federated Data Aggregated Data Linked Data Just DataStructured Unstructured ● in todays IT environments there is a gradual shift from structured data to unstructured data RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ? Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data? Natalino Busa - 12 Feb. 2013
  • 9. Big Data: how to deal with it ● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows) ● Big Data analytics (OLAP, OTAP, BI) ● Big Data modeling (predictive, machine learning) Natalino Busa - 12 Feb. 2013
  • 10. Big Data at restAnalytical RDBMSs (EDW) Oracle, IBM, and various MPPsHadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table) Batch Real-time Cassandra HBase Analytics Logs HDFS EDW EDW EDW ● Traditional EDW and Distributed ● These systems do not exclude each BigData / NoSQL solutions are others and can coexist to form a full complementary to each other. enterprise level solution. Natalino Busa - 12 Feb. 2013
  • 11. Big Data at restNo need to get everything out of the hadoop ecosystem:NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP)... hybrid solutions are also possible:HDFS + Cassandra : in-memory analytics + large DFSHDFS + Solr/Lucene: fast text search on a distributed file system Natalino Busa - 12 Feb. 2013
  • 12. Big Data in motionStream processing // Dataflow architecturesUsed to support the automatic analysis of data-in-motion in real-time or near real-time.- Identify meaningful patterns- Trigger action to respond to them as quickly as possible. - Storm (from twitter) dataflow processing framework ++ multi-language - Akka (from typesafe) dataflow actor framework ++ speed Both are: Distributed, fault-tolerant, streaming Natalino Busa - 12 Feb. 2013
  • 13. Big Data Landscape Machine Learning on Big Data Unstructured SAS, R over HDFS Mahout REST Logs flume Hbase HiveData Interfaces scribe ● Batch Analytics HDFS ● Visualization MapR BI ● Monitoring ● Marketing sqoop Cassandra Pig EDW hiho Unstructured FS OLAP OTAP Impala ● Real-Time Analytics ● Streaming STORM Natalino Busa - 12 Feb. 2013
  • 14. Lambda Architecture Logic layer Software as a Service e.g realt-time predictorfrom http://www.manning.com/marz/ Natalino Busa - 12 Feb. 2013
  • 15. Why do machine learning on big data http://www.skytree.net/why-do-machine-learning-on-big-data/ Natalino Busa - 12 Feb. 2013
  • 16. Machine Learning: What? SIMILARITY SEARCH Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest. PREDICTIVE ANALYTICS Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events. CLUSTERING AND SEGMENTATION Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data.From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/ Natalino Busa - 12 Feb. 2013
  • 17. Word Counting on Map Reduce Natalino Busa - 12 Feb. 2013
  • 18. Machine learning on Map Reduce From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013
  • 19. Machine learning on Map ReduceFrom http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013
  • 20. Machine Learning: Use Cases E-Commerce / E-Tailing ● Product Recommendation Engines ● Cross Channel Analytics ● Events/Activity Behavior Segmentation Product Marketing ● Campaign management and optimization ● Market and consumer segmentations ● Pricing Optimization Customer Marketing ● Customer Churn Management ● (Mobile) User Behavior Prediction ● Offer Personalization Natalino Busa - 12 Feb. 2013
  • 21. Big Data: Opportunities Unstructured Data ● Clustering ● Distributed processing ● Distributed Storage Modeling & Analytics ● Distributed Machine Learning ● Fast Online Analytics Cubes Streaming and Real-Time processing ● Build RT profiles ● Decision trees and Predictions ● Offer Personalization Natalino Busa - 12 Feb. 2013
  • 22. Thanks linkedin: www.linkedin.com/in/natalinobusa blog: www.natalinobusa.com