• Save
20131118 - Seoul - Advanced Computing Conference 10 - New Trends In Hadoop
 

Like this? Share it with your network

Share

20131118 - Seoul - Advanced Computing Conference 10 - New Trends In Hadoop

on

  • 747 views

http://www.zdnet.co.kr/news/news_view.asp?artice_id=20131119145440

http://www.zdnet.co.kr/news/news_view.asp?artice_id=20131119145440

Statistics

Views

Total Views
747
Views on SlideShare
745
Embed Views
2

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Add your contact information including “MapR Job Title” starting with the word “MapR” such as MapR Solutions Architect etc.Add the #hashtag of the particular meet-up and chose one or more of the others BUT NOT ALL OF THEM
  • Reduce cost and time
  • This slide and the following demos sets the stage to introduce the overall big idea of talk: that there are many cool advances in these different areas, but without the ideas introduced here, it’s hard to connect these technologies together easily and reliably –
  • This slide and the following demos sets the stage to introduce the overall big idea of talk: that there are many cool advances in these different areas, but without the ideas introduced here, it’s hard to connect these technologies together easily and reliably –
  • Gives up random access read on filesGives up strong authentication / authorization modelGives up random access write / append on files
  • http://trends.truliablog.com/vis/metro-movers/
  • Main point: these applications can be served directly from MapR filesystem because it presents as a POSIX filesystem using NFS protocol.Non-bigdata applications like Node.JS and D3.js can work directly off of MapR, and don’t care that it is also a highly scalable DFS
  • Talk track: Let’s look at another example of new technologies on Hadoop in more detailReal-time analysis has many uses, such as Sentiment AnalysisWhat is sentiment analysis and why would you use it? (then start into slide details)
  • Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  • Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  • Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Note to speaker: listen to explanation at about 23min into Ted’s video for this slide and next slide showing the sliding window
  • Note to speaker: here is the option if you don’t have a distributed system with a reatlime replicated filesystem (such as what MapR has). Alternative is to use something like Kafka. Much harder to build etc.
  • “redundant redundancy” is about too many data copies = inefficient design and risk of inconsistency between systems
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Talk track: Let’s look at another example of new technologies on Hadoop in more detailReal-time analysis has many uses, such as Sentiment AnalysisWhat is sentiment analysis and why would you use it? (then start into slide details)
  • Allen: Is this useful? Is itMapR specific?
  • What is NAS? Should the little elephant say MapR? Remember this slide was from a sales pitch, so you may have to make clear it’s talking about MapR
  • Allen: Do we want to use the images including Korean bowl or better to avoid possible negative reaction and just use a text slide for transition to thinking about legacy applications?
  • Another example of the same thing (this slide can be suppressed – just another view of the inaccuracies)
  • I hid this one and went with the other format of your slide.
  • Again, I’ve hidden this version and gone with the one that is formatted differently. Same content, though.
  • Talk track: In this presentation, we are going to look in more detail at the tools and technologies used for this middle stage, that is done by machine, the processing and visualization/reporting steps…(Allen: don’t know if you want to say more details here, but I suspect this slide is just a transition to get audience focused on the processing & the viz/reporting part of the work flow.
  • Note to speaker: Slide just sets up transition from business goals to architecture diagram slide Don’t need a lot of detail, but introduce Storm, say a little about project & new to Apache AND MENTION MAPR’s Ted Dunning is one of the PROJECT MENTORSTalk track: Now that you see what is the business goal of using real-time sentiment analysis, let’s look at the architecture for a sample project… Here are some technologies you’ll need…{explain] and then say “and on the next slide we see a diagram of the architectural design – or call it work flow?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. DOES THE WEB SerVER/WEB DATA ALSO COME OFF if not a MapR cluster?????
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. DOES THE WEB SerVER/WEB DATA ALSO COME OFF if not a MapR cluster?????
  • The distinction is not clear to me… Attempt 2 meaning different slide to show same idea? Confusing to mix symbols of actions/ ideas with components in work flow
  • This is my start on the non-MapR view
  • Do you want to include this information?
  • MapR enables integration by providing industry-standard interfacesMore 3rd party solutions work with MapR than any other distributionProprietary connectors not neededNFSAll file-based applications can read and write dataExamples: Linux utilities, file browsers, Informatica UltraMessagingODBC 3.52All BI applications can leverage HiveExamples: Excel, Crystal Reports, Tableau, MicroStrategyLinux PAMAny authentication provider can be usedExamples: LDAP, Kerberos, 3rd party
  • NNote from Ted’s video hints: HBase storage here can be handy but is much more of a sideline to the main idea.

20131118 - Seoul - Advanced Computing Conference 10 - New Trends In Hadoop Presentation Transcript

  • 1. Making Hadoop Work for Everybody ©MapR Technologies - Confidential 1
  • 2. Making Hadoop Work for Everybody  Allen Day, MapR Technologies, Principal Data Scientist  Contact: – –  Slides: –  Email: allenday@maprtech.com Twitter: @allenday http://slideshare.net/allenday Hash tags: # ©MapR Technologies - Confidential 2
  • 3. Hadoop adoption is widespread. What happens next? ©MapR Technologies - Confidential 3
  • 4. Big Data Trends: Where Are We Going?  Big Data Storage  Big Data Applications – don’t just store data but mine it to extract the full benefits. – Real-time processing requirements make low latency important – Many exciting new technologies are available  Going from academic  practical – take machine learning & advanced analytics from the research lab into business environments.  “Simple algorithms and lots of data trump complex models” [1] – elaborate designs may not give the best business benefits in production. Simplicity is the key to success.  Re-usability shortens time-to-market – use pre-existing components and familiar architectural design patterns to reduce development cost and time [1] Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems ©MapR Technologies - Confidential 4
  • 5. Big Data Trends: Who is Driving?  Web camp –  Big data camp –  non-traditional scalable file systems Everybody else –  everything is a service with a URL or a DOM files and databases But… They don’t easily work together ©MapR Technologies - Confidential 5
  • 6. This is not a problem. It’s an opportunity ©MapR Technologies - Confidential 6
  • 7. New Technologies: Can They Play Together? Examples of excellent modern technologies  d3.js[1] does real-time, interactive visualization for excellent images of data  node.js[2] allows simple (not just web) servers  Apache Storm[3] does real-time processing  Hadoop does big data distributed storage really well But HDFS makes Hadoop stand somewhat alone  Special steps are needed to ingest and access data on a Hadoop cluster MapR has changed that . . . [1] http://d3js.org [2] http://nodejs.org [3] http://incubator.apache.org/storm ©MapR Technologies - Confidential 7
  • 8. Evolution of Data Storage Scalability Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 8
  • 9. Evolution of Data Storage Scalability Hadoop achieves much higher Hadoop scalability by trading away essentially all of this compatibility Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 9
  • 10. Evolution of Data Storage Scalability Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 10
  • 11. MapR Data Storage: How it’s done HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS 11
  • 12. MapR Data Storage: How it’s done Vertical Integration = High Performance HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS 12
  • 13. Hadoop on MapR No Longer Stands Apart Legacy code & applications New technologies d3 node.js Apache Storm Multiple types of data sources New custom applications MapR cluster ©MapR Technologies - Confidential 13
  • 14. What does this compatibility mean for you? ©MapR Technologies - Confidential 14
  • 15. Example: visualization of big data ©MapR Technologies - Confidential 15
  • 16. Visualization Gives Data Impact New technologies include visualization tools like D3.js [1] and Node.js [2] • POSIX tools that run and scale easily on MapR • http://trends.truliablog.com/vis/metro-movers/ [1] http://d3js.org/ [2] http://nodejs.org ©MapR Technologies - Confidential 16
  • 17. Visualization Gives Data Impact New technologies include visualization tools like D3.js [1] and Node.js [2] • POSIX tools that run and scale easily on MapR • http://trends.truliablog.com/vis/metro-movers/ [1] http://d3js.org/ [2] http://nodejs.org ©MapR Technologies - Confidential 17
  • 18. Example: real-time on Hadoop ©MapR Technologies - Confidential 18
  • 19. Sentiment Analysis In Real-time  Business Goal: Who is having a bad experience with my brand and how can I fix it?  What does the result look like? – – – – – –  Show me now On Twitter Who it is… …how they feel …and what product/service they’re interacting with ALSO show me patterns of feelings related to my products/services How: real-time big data analytics ©MapR Technologies - Confidential 19
  • 20. Business Goals: From Data to Insight etc Twitter etc ©MapR Technologies - Confidential Processing Visualization and Reporting Machine analysis 20 Interpret Find Value & Execute Human insight
  • 21. Analytics Architecture: How to Process? etc Twitter etc Processing Visualization and Reporting Machine analysis Interpret Find Value & Execute Human insight  Aggregation and Queuing: depends on whether you use MapR or other Hadoop distro (explain later)  Real-time processing: Apache Storm [1] – Established open source project for robust, distributed RT processing ©MapR Technologies - Confidential 21
  • 22. Analytics Architecture: How to Display? etc Twitter etc Processing Visualization and Reporting Machine analysis Interpret Human insight  Visualization: many choices, e.g. D3.js, Tableau, Processing  Web server: many choices, e.g. node.js, Twisted Web etc. ©MapR Technologies - Confidential 22 Find Value & Execute
  • 23. Analytics Architecture: End-to-End Twitter Twitter API TweetLogger MapR ©MapR Technologies - Confidential http Web-server Catcher Storm Topic Queue 23 NFS Web Data
  • 24. Aggregation and Queuing Layer Design  Apache Storm provides real-time processing framework – – – –  Record-oriented model Function()s transform record streams into new record streams Distributed, failure-tolerant, and scalable No inherent state MapR provides the real-time processing storage – – – Process records and emit values (optionally writing to the file system) Records have to be acknowledged or else they will be retransmitted Provides failure tolerance ©MapR Technologies - Confidential 24
  • 25. Real-time on Hadoop demo ©MapR Technologies - Confidential 25
  • 26. Demo for Real-time on Hadoop/ MapR  What application does – – –  Technical requirements – –  Reads tweets as they happen Remembers the top few words Makes engaging pictures Handle restarts well Be fault tolerant Best practice design tip: – Keep it really simple ©MapR Technologies - Confidential 26
  • 27. [DEMO:cached] [DEMO:live] ©MapR Technologies - Confidential 27
  • 28. Hadoop on MapR No Longer Stands Apart Twitter Twitter API D3 Visualization TweetLogger MapR cluster Apache Storm ©MapR Technologies - Confidential 28
  • 29. Importance of a Real-time File System  This application design provides – – A distributed, partitioned, multi-subscriber commit log With replication and failure tolerance  This application design is easy to implement…  Because the hard problems are solved at the platform layer – – –  No need for replication in the queuing layer Failure tolerance is trivial, well-hardened in production Performance even with replication is very, very high But … Not all Hadoop distributions include a real-time file system ©MapR Technologies - Confidential 29
  • 30. Alternative Queuing Layer - Kafka  Apache Kafka – – –  Provides distributed, partitioned, multi-subscriber commit log As of 0.8 beta, also supports replication of data Is well-tested. It is used extensively in production at high volumes But … Kafka requires a separate cluster (not needed with MapR) – – – Data must be persisted in multiple clusters (Storm & Kafka) Replication capability is new and not well-tested for mission-critical environments Failure tolerance is implemented at the application layer. This means… This design does not generalize to other ©MapR Technologies - Confidential 30
  • 31. Without MapR: many clusters * HDFS Data Flume Flume Flume Cluster Hadoop Cluster Flume Cluster Twitter Twitter API * * * Kafka API Kafka Kafka Kafka Cluster Cluster Cluster Kafka API Kafka Storm Twitter Scraper Report Data http Web-server ©MapR Technologies - Confidential 31 Web Service NAS
  • 32. Analytics Architecture: End-to-End Twitter Twitter API TweetLogger MapR ©MapR Technologies - Confidential http Web-server Catcher Storm * Topic Queue 32 NFS Web Data
  • 33. Sentiment Analysis In Real-time  Business Goal: Who is having a bad experience with my brand and how can I fix it?  What does the result look like? – – – – – – –  Show me now On Twitter Who it is… …how they feel …and what product/service they’re interacting with ALSO show me patterns of feelings related to my products/services ALSO allow retrospective analysis for R&D How: real-time big data analytics ©MapR Technologies - Confidential 33
  • 34. MapR Data Platform Advantage Twitter Twitter API D3 Visualization TweetLogger R&D Batch Analytics MapR cluster Other Applications Apache Storm Such as . . ©MapR Technologies - Confidential 34
  • 35. Data Warehouse / Hadoop ETL Offload ❷ Extract Billing Systems Clean Transform MapR Distribution for Apache Hadoop ❶ Accounts Receivable Structured & Unstructured Data Conform N1 N1 N1 N1 … Structured Data ❹ N1 Financial Reporting ❸ Revenue Accounting Other BI Teradata Front-ends ❺ ©MapR Technologies - Confidential Data Warehouse and Analytics 35
  • 36. When Hadoop Looks Like a NAS…  Data ingestion is easy – Popular online gaming company changed data ingestion from a complex Flume cluster to a 17-line Python script Logs Application servers  Database bulk import/export with standard vendor tools –  Large telecom saved $30M on Teradata costs by pre-processing with MapR 1000s of applications/tools – Large credit card company uses MapR volumes as the user home directories on the Hadoop gateway servers ©MapR Technologies - Confidential 36 $ $ $ $ $ find . | grep log cp vi results.csv scp tail -f part-00000
  • 37. Recap  Exciting times in Hadoop, lots of new and exciting capabilities  Exciting times in world of webservers, visualization and real-time  Integration of both worlds is easier than it looks at first –  … if you have a real-time filesystem MapR Data Platform is widely compatible – – – Use legacy code & applications without having to re-write Access traditional data stores more easily Connect to new technologies directly ©MapR Technologies - Confidential 37
  • 38. Thank You!! ©MapR Technologies - Confidential 38