Your SlideShare is downloading. ×
Making Hadoop Work for
Everybody

©MapR Technologies - Confidential

1
Making Hadoop Work for Everybody


Allen Day, MapR Technologies, Principal Data Scientist



Contact:
–
–



Slides:
–
...
Hadoop adoption is
widespread.
What happens next?

©MapR Technologies - Confidential

3
Big Data Trends: Where Are We Going?


Big Data Storage  Big Data Applications – don’t just store data but mine it
to ex...
Big Data Trends: Who is Driving?


Web camp
–



Big data camp
–



non-traditional scalable file systems

Everybody el...
This is not a problem.

It’s an opportunity

©MapR Technologies - Confidential

6
New Technologies: Can They Play Together?
Examples of excellent modern technologies


d3.js[1] does real-time, interactiv...
Evolution of Data Storage

Scalability
Over decades of progress,
Unix-based systems have set
the standard for compatibilit...
Evolution of Data Storage

Scalability
Hadoop achieves much higher
Hadoop
scalability by trading away
essentially all of t...
Evolution of Data Storage

Scalability
Hadoop

MapR enhances Apache Hadoop by
restoring the compatibility while
increasing...
MapR Data Storage: How it’s done
HBase
NoSQL Tables API

POSIX NFS

implements

depends

Apache
HBase

implements

impleme...
MapR Data Storage: How it’s done
Vertical Integration = High Performance
HBase
NoSQL Tables API

POSIX NFS

implements

de...
Hadoop on MapR No Longer Stands Apart

Legacy code &
applications

New technologies
d3
node.js
Apache Storm

Multiple type...
What does this
compatibility
mean for you?

©MapR Technologies - Confidential

14
Example:
visualization of
big data

©MapR Technologies - Confidential

15
Visualization Gives Data Impact
New technologies include visualization tools like D3.js [1] and Node.js [2]
• POSIX tools ...
Visualization Gives Data Impact
New technologies include visualization tools like D3.js [1] and Node.js [2]
• POSIX tools ...
Example:
real-time on
Hadoop

©MapR Technologies - Confidential

18
Sentiment Analysis In Real-time


Business Goal: Who is having a bad experience with my brand and
how can I fix it?



W...
Business Goals: From Data to Insight
etc

Twitter

etc

©MapR Technologies - Confidential

Processing

Visualization
and R...
Analytics Architecture: How to Process?
etc

Twitter

etc

Processing

Visualization
and Reporting

Machine analysis

Inte...
Analytics Architecture: How to Display?
etc

Twitter

etc

Processing

Visualization
and Reporting

Machine analysis

Inte...
Analytics Architecture: End-to-End
Twitter
Twitter
API

TweetLogger

MapR

©MapR Technologies - Confidential

http

Web-se...
Aggregation and Queuing Layer Design


Apache Storm provides real-time processing framework
–
–

–
–



Record-oriented ...
Real-time on
Hadoop
demo

©MapR Technologies - Confidential

25
Demo for Real-time on Hadoop/ MapR


What application does
–
–

–



Technical requirements
–
–



Reads tweets as they...
[DEMO:cached]
[DEMO:live]

©MapR Technologies - Confidential

27
Hadoop on MapR No Longer Stands Apart
Twitter
Twitter
API

D3
Visualization

TweetLogger

MapR cluster
Apache Storm

©MapR...
Importance of a Real-time File System


This application design provides
–
–

A distributed, partitioned, multi-subscribe...
Alternative Queuing Layer - Kafka


Apache Kafka
–
–

–



Provides distributed, partitioned, multi-subscriber commit lo...
Without MapR: many clusters

*

HDFS
Data

Flume
Flume
Flume
Cluster

Hadoop Cluster

Flume Cluster
Twitter

Twitter
API

...
Analytics Architecture: End-to-End
Twitter
Twitter
API

TweetLogger

MapR

©MapR Technologies - Confidential

http

Web-se...
Sentiment Analysis In Real-time


Business Goal: Who is having a bad experience with my brand and
how can I fix it?



W...
MapR Data Platform Advantage
Twitter
Twitter
API

D3
Visualization

TweetLogger

R&D Batch
Analytics

MapR cluster

Other
...
Data Warehouse / Hadoop ETL Offload
❷

Extract
Billing
Systems

Clean

Transform

MapR Distribution for
Apache Hadoop

❶

...
When Hadoop Looks Like a NAS…


Data ingestion is easy
–

Popular online gaming company changed
data ingestion from a com...
Recap


Exciting times in Hadoop, lots of new and exciting capabilities



Exciting times in world of webservers, visual...
Thank You!!

©MapR Technologies - Confidential

38
Upcoming SlideShare
Loading in...5
×

20131118 - Seoul - Advanced Computing Conference 10 - New Trends In Hadoop

593

Published on

http://www.zdnet.co.kr/news/news_view.asp?artice_id=20131119145440

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
593
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Add your contact information including “MapR Job Title” starting with the word “MapR” such as MapR Solutions Architect etc.Add the #hashtag of the particular meet-up and chose one or more of the others BUT NOT ALL OF THEM
  • Reduce cost and time
  • This slide and the following demos sets the stage to introduce the overall big idea of talk: that there are many cool advances in these different areas, but without the ideas introduced here, it’s hard to connect these technologies together easily and reliably –
  • This slide and the following demos sets the stage to introduce the overall big idea of talk: that there are many cool advances in these different areas, but without the ideas introduced here, it’s hard to connect these technologies together easily and reliably –
  • Gives up random access read on filesGives up strong authentication / authorization modelGives up random access write / append on files
  • http://trends.truliablog.com/vis/metro-movers/
  • Main point: these applications can be served directly from MapR filesystem because it presents as a POSIX filesystem using NFS protocol.Non-bigdata applications like Node.JS and D3.js can work directly off of MapR, and don’t care that it is also a highly scalable DFS
  • Talk track: Let’s look at another example of new technologies on Hadoop in more detailReal-time analysis has many uses, such as Sentiment AnalysisWhat is sentiment analysis and why would you use it? (then start into slide details)
  • Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  • Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  • Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Note to speaker: listen to explanation at about 23min into Ted’s video for this slide and next slide showing the sliding window
  • Note to speaker: here is the option if you don’t have a distributed system with a reatlime replicated filesystem (such as what MapR has). Alternative is to use something like Kafka. Much harder to build etc.
  • “redundant redundancy” is about too many data copies = inefficient design and risk of inconsistency between systems
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Talk track: Let’s look at another example of new technologies on Hadoop in more detailReal-time analysis has many uses, such as Sentiment AnalysisWhat is sentiment analysis and why would you use it? (then start into slide details)
  • Allen: Is this useful? Is itMapR specific?
  • What is NAS? Should the little elephant say MapR? Remember this slide was from a sales pitch, so you may have to make clear it’s talking about MapR
  • Allen: Do we want to use the images including Korean bowl or better to avoid possible negative reaction and just use a text slide for transition to thinking about legacy applications?
  • Another example of the same thing (this slide can be suppressed – just another view of the inaccuracies)
  • I hid this one and went with the other format of your slide.
  • Again, I’ve hidden this version and gone with the one that is formatted differently. Same content, though.
  • Talk track: In this presentation, we are going to look in more detail at the tools and technologies used for this middle stage, that is done by machine, the processing and visualization/reporting steps…(Allen: don’t know if you want to say more details here, but I suspect this slide is just a transition to get audience focused on the processing & the viz/reporting part of the work flow.
  • Note to speaker: Slide just sets up transition from business goals to architecture diagram slide Don’t need a lot of detail, but introduce Storm, say a little about project & new to Apache AND MENTION MAPR’s Ted Dunning is one of the PROJECT MENTORSTalk track: Now that you see what is the business goal of using real-time sentiment analysis, let’s look at the architecture for a sample project… Here are some technologies you’ll need…{explain] and then say “and on the next slide we see a diagram of the architectural design – or call it work flow?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. DOES THE WEB SerVER/WEB DATA ALSO COME OFF if not a MapR cluster?????
  • Allen: This is the MapR view. You can do it first or the non-mapr view first. DOES THE WEB SerVER/WEB DATA ALSO COME OFF if not a MapR cluster?????
  • The distinction is not clear to me… Attempt 2 meaning different slide to show same idea? Confusing to mix symbols of actions/ ideas with components in work flow
  • This is my start on the non-MapR view
  • Do you want to include this information?
  • MapR enables integration by providing industry-standard interfacesMore 3rd party solutions work with MapR than any other distributionProprietary connectors not neededNFSAll file-based applications can read and write dataExamples: Linux utilities, file browsers, Informatica UltraMessagingODBC 3.52All BI applications can leverage HiveExamples: Excel, Crystal Reports, Tableau, MicroStrategyLinux PAMAny authentication provider can be usedExamples: LDAP, Kerberos, 3rd party
  • NNote from Ted’s video hints: HBase storage here can be handy but is much more of a sideline to the main idea.
  • Transcript of "20131118 - Seoul - Advanced Computing Conference 10 - New Trends In Hadoop"

    1. 1. Making Hadoop Work for Everybody ©MapR Technologies - Confidential 1
    2. 2. Making Hadoop Work for Everybody  Allen Day, MapR Technologies, Principal Data Scientist  Contact: – –  Slides: –  Email: allenday@maprtech.com Twitter: @allenday http://slideshare.net/allenday Hash tags: # ©MapR Technologies - Confidential 2
    3. 3. Hadoop adoption is widespread. What happens next? ©MapR Technologies - Confidential 3
    4. 4. Big Data Trends: Where Are We Going?  Big Data Storage  Big Data Applications – don’t just store data but mine it to extract the full benefits. – Real-time processing requirements make low latency important – Many exciting new technologies are available  Going from academic  practical – take machine learning & advanced analytics from the research lab into business environments.  “Simple algorithms and lots of data trump complex models” [1] – elaborate designs may not give the best business benefits in production. Simplicity is the key to success.  Re-usability shortens time-to-market – use pre-existing components and familiar architectural design patterns to reduce development cost and time [1] Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems ©MapR Technologies - Confidential 4
    5. 5. Big Data Trends: Who is Driving?  Web camp –  Big data camp –  non-traditional scalable file systems Everybody else –  everything is a service with a URL or a DOM files and databases But… They don’t easily work together ©MapR Technologies - Confidential 5
    6. 6. This is not a problem. It’s an opportunity ©MapR Technologies - Confidential 6
    7. 7. New Technologies: Can They Play Together? Examples of excellent modern technologies  d3.js[1] does real-time, interactive visualization for excellent images of data  node.js[2] allows simple (not just web) servers  Apache Storm[3] does real-time processing  Hadoop does big data distributed storage really well But HDFS makes Hadoop stand somewhat alone  Special steps are needed to ingest and access data on a Hadoop cluster MapR has changed that . . . [1] http://d3js.org [2] http://nodejs.org [3] http://incubator.apache.org/storm ©MapR Technologies - Confidential 7
    8. 8. Evolution of Data Storage Scalability Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 8
    9. 9. Evolution of Data Storage Scalability Hadoop achieves much higher Hadoop scalability by trading away essentially all of this compatibility Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 9
    10. 10. Evolution of Data Storage Scalability Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 10
    11. 11. MapR Data Storage: How it’s done HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS 11
    12. 12. MapR Data Storage: How it’s done Vertical Integration = High Performance HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS 12
    13. 13. Hadoop on MapR No Longer Stands Apart Legacy code & applications New technologies d3 node.js Apache Storm Multiple types of data sources New custom applications MapR cluster ©MapR Technologies - Confidential 13
    14. 14. What does this compatibility mean for you? ©MapR Technologies - Confidential 14
    15. 15. Example: visualization of big data ©MapR Technologies - Confidential 15
    16. 16. Visualization Gives Data Impact New technologies include visualization tools like D3.js [1] and Node.js [2] • POSIX tools that run and scale easily on MapR • http://trends.truliablog.com/vis/metro-movers/ [1] http://d3js.org/ [2] http://nodejs.org ©MapR Technologies - Confidential 16
    17. 17. Visualization Gives Data Impact New technologies include visualization tools like D3.js [1] and Node.js [2] • POSIX tools that run and scale easily on MapR • http://trends.truliablog.com/vis/metro-movers/ [1] http://d3js.org/ [2] http://nodejs.org ©MapR Technologies - Confidential 17
    18. 18. Example: real-time on Hadoop ©MapR Technologies - Confidential 18
    19. 19. Sentiment Analysis In Real-time  Business Goal: Who is having a bad experience with my brand and how can I fix it?  What does the result look like? – – – – – –  Show me now On Twitter Who it is… …how they feel …and what product/service they’re interacting with ALSO show me patterns of feelings related to my products/services How: real-time big data analytics ©MapR Technologies - Confidential 19
    20. 20. Business Goals: From Data to Insight etc Twitter etc ©MapR Technologies - Confidential Processing Visualization and Reporting Machine analysis 20 Interpret Find Value & Execute Human insight
    21. 21. Analytics Architecture: How to Process? etc Twitter etc Processing Visualization and Reporting Machine analysis Interpret Find Value & Execute Human insight  Aggregation and Queuing: depends on whether you use MapR or other Hadoop distro (explain later)  Real-time processing: Apache Storm [1] – Established open source project for robust, distributed RT processing ©MapR Technologies - Confidential 21
    22. 22. Analytics Architecture: How to Display? etc Twitter etc Processing Visualization and Reporting Machine analysis Interpret Human insight  Visualization: many choices, e.g. D3.js, Tableau, Processing  Web server: many choices, e.g. node.js, Twisted Web etc. ©MapR Technologies - Confidential 22 Find Value & Execute
    23. 23. Analytics Architecture: End-to-End Twitter Twitter API TweetLogger MapR ©MapR Technologies - Confidential http Web-server Catcher Storm Topic Queue 23 NFS Web Data
    24. 24. Aggregation and Queuing Layer Design  Apache Storm provides real-time processing framework – – – –  Record-oriented model Function()s transform record streams into new record streams Distributed, failure-tolerant, and scalable No inherent state MapR provides the real-time processing storage – – – Process records and emit values (optionally writing to the file system) Records have to be acknowledged or else they will be retransmitted Provides failure tolerance ©MapR Technologies - Confidential 24
    25. 25. Real-time on Hadoop demo ©MapR Technologies - Confidential 25
    26. 26. Demo for Real-time on Hadoop/ MapR  What application does – – –  Technical requirements – –  Reads tweets as they happen Remembers the top few words Makes engaging pictures Handle restarts well Be fault tolerant Best practice design tip: – Keep it really simple ©MapR Technologies - Confidential 26
    27. 27. [DEMO:cached] [DEMO:live] ©MapR Technologies - Confidential 27
    28. 28. Hadoop on MapR No Longer Stands Apart Twitter Twitter API D3 Visualization TweetLogger MapR cluster Apache Storm ©MapR Technologies - Confidential 28
    29. 29. Importance of a Real-time File System  This application design provides – – A distributed, partitioned, multi-subscriber commit log With replication and failure tolerance  This application design is easy to implement…  Because the hard problems are solved at the platform layer – – –  No need for replication in the queuing layer Failure tolerance is trivial, well-hardened in production Performance even with replication is very, very high But … Not all Hadoop distributions include a real-time file system ©MapR Technologies - Confidential 29
    30. 30. Alternative Queuing Layer - Kafka  Apache Kafka – – –  Provides distributed, partitioned, multi-subscriber commit log As of 0.8 beta, also supports replication of data Is well-tested. It is used extensively in production at high volumes But … Kafka requires a separate cluster (not needed with MapR) – – – Data must be persisted in multiple clusters (Storm & Kafka) Replication capability is new and not well-tested for mission-critical environments Failure tolerance is implemented at the application layer. This means… This design does not generalize to other ©MapR Technologies - Confidential 30
    31. 31. Without MapR: many clusters * HDFS Data Flume Flume Flume Cluster Hadoop Cluster Flume Cluster Twitter Twitter API * * * Kafka API Kafka Kafka Kafka Cluster Cluster Cluster Kafka API Kafka Storm Twitter Scraper Report Data http Web-server ©MapR Technologies - Confidential 31 Web Service NAS
    32. 32. Analytics Architecture: End-to-End Twitter Twitter API TweetLogger MapR ©MapR Technologies - Confidential http Web-server Catcher Storm * Topic Queue 32 NFS Web Data
    33. 33. Sentiment Analysis In Real-time  Business Goal: Who is having a bad experience with my brand and how can I fix it?  What does the result look like? – – – – – – –  Show me now On Twitter Who it is… …how they feel …and what product/service they’re interacting with ALSO show me patterns of feelings related to my products/services ALSO allow retrospective analysis for R&D How: real-time big data analytics ©MapR Technologies - Confidential 33
    34. 34. MapR Data Platform Advantage Twitter Twitter API D3 Visualization TweetLogger R&D Batch Analytics MapR cluster Other Applications Apache Storm Such as . . ©MapR Technologies - Confidential 34
    35. 35. Data Warehouse / Hadoop ETL Offload ❷ Extract Billing Systems Clean Transform MapR Distribution for Apache Hadoop ❶ Accounts Receivable Structured & Unstructured Data Conform N1 N1 N1 N1 … Structured Data ❹ N1 Financial Reporting ❸ Revenue Accounting Other BI Teradata Front-ends ❺ ©MapR Technologies - Confidential Data Warehouse and Analytics 35
    36. 36. When Hadoop Looks Like a NAS…  Data ingestion is easy – Popular online gaming company changed data ingestion from a complex Flume cluster to a 17-line Python script Logs Application servers  Database bulk import/export with standard vendor tools –  Large telecom saved $30M on Teradata costs by pre-processing with MapR 1000s of applications/tools – Large credit card company uses MapR volumes as the user home directories on the Hadoop gateway servers ©MapR Technologies - Confidential 36 $ $ $ $ $ find . | grep log cp vi results.csv scp tail -f part-00000
    37. 37. Recap  Exciting times in Hadoop, lots of new and exciting capabilities  Exciting times in world of webservers, visualization and real-time  Integration of both worlds is easier than it looks at first –  … if you have a real-time filesystem MapR Data Platform is widely compatible – – – Use legacy code & applications without having to re-write Access traditional data stores more easily Connect to new technologies directly ©MapR Technologies - Confidential 37
    38. 38. Thank You!! ©MapR Technologies - Confidential 38

    ×