• Like
Upcoming SlideShare
Loading in...5


Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 2. Plan1. Introduction and Motivation2. Problem Domain – Big Data for Android Devices3. Solution Domain 1. Solution Architecture 2. Programming Paradigms 3. Map Reduce 4. Design Patterns Used 5. Implementation4. Summary and Discussion
  • 3. Introduction Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail.  The SQL systems fail not only to scale up but also to provide desired functionality.  For example back bone of Facebook/Twitter/LinkedIn/Google The common data pattern for the companies above huge amount of un-structured data.
  • 4. Unstructured Data vs. StructuredData However most data is unstructured or semi structured, think of twits, likes, profile updates … SQL is structured data i.e. the types (mostly primitive) and the fields are known in advance and there is little deviation from the flat table norm.
  • 5. Android World Android smart phone are getting smarter They handle more and more data Big data patterns are dropping to smart phone market Current big data solutions such as Hadoop are not appropriate, because they solve the wrong problem  Filesystem  Multi machine load balancing
  • 6. Constraints for Android basedapplications Application most of the time sleeps and doesn’t run Impossible to have fault tolerant file systems Save battery and CPU power  Reuse of containers  Sharing resources – ashmem, strings pool
  • 7. Problem Single – thread approaches for data handling (sort/search/analyze) are naïve:  Gettingslower  Awkward to develop and maintain  No multi core/threading utilization
  • 8. Problem Domain Definition Problem Domain is a world where the software product requirements exist. Technically speaking Conceptual model which describes the:  Various entities  Attributes and relationships  Scope and Constraints
  • 9. Problem Domain Case Study Semi structural text based data functionalities needed  Word Counting  Inverted Index  is a mapping between words the the documents where the words appear.  Distributed Grep
  • 10. Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick fox, 1 brown, 2 Mapbrown fox fox, 2 Reduce how, 1 the, 1 fox, 1 now, 1 the, 1 the, 3the fox ate Mapthe mouse quick, 1 how, 1 ate, 1 ate, 1 now, 1 mouse, 1 brown, 1 Reduce cow, 1how now mouse, 1 Map cow, 1brown cow quick, 1
  • 11. Inverted Index Example hamlet.txt to, hamlet.txtto be or not be, hamlet.txt to be or, hamlet.txt afraid, (12th.txt) not, hamlet.txt be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) be, 12th.txt or, (hamlet.txt) 12th.txt not, 12th.txt to, (hamlet.txt)be not afraid afraid, 12th.txtof greatness of, 12th.txt greatness, 12th.txt
  • 12. Solution Domain Conceptual model to cover the use cases of the Problem Domain, which describes:  All entities and relationships related to the “implementation“ of the problem  Analysis and Architectural Patterns  Design Patterns Solution domain is greater than the Problem domain, because Solution Domain adds entries that are taken from granted in Problem Domain (such as factory).
  • 13. Welcome “Profiterole” Open Source Java/Android Big Data solution that implements Map Reduce algorithm Operates on large text files by breaking them to chunks Template Based – not only for strings and integers but for any comparable objects Fully concurrent Optional Lua based post processing
  • 14. Example of profiterole output
  • 15. High Level View
  • 16. Map Reduce Is a framework for processing highly distributable problems across huge datasets using a large number of computers/threads/cpus. The framework contains both Map and Reduce functions. The motivation is large size of input data combined with a lot of machines available (thus need to be used effectively)
  • 17. Map-Reduce1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 18. Map-Reduce R M EVery A D Result big P Udata C E Map:  Reduce :  Accepts input  Accepts intermediate key/value pair key/value pair  Emits intermediate  Emits output key/value pair key/value pair
  • 19. Design Paradigms Working convention, way to think about program. Goal of the paradigm  to think and get a program.
  • 20. Programming Paradigms Programming Paradigm is a conceptual model for creating programs, supported by programming language. Paradigms differ in the concepts and abstractions used to:  Represent the elements of a program such as objects, functions, variables, constraints, etc.  Represent the steps that compose a computation such as assignment, evaluation, continuations, data flows, etc...
  • 21. Programming Paradigms inProfiterole Map-Reduce problems are functional in their nature map reduce are first class citizens. All the development – done in Java that is is object oriented language Few parts are generic Results are table based Need to find a tradeoff for example what can be solved by generics and what can be solved by inheritance.
  • 22. Design Patterns Architectural solutions needed to solve problems in context All in all patterns are no more than structures how to connect classes. But this is mechanical definition, the real value definition pattern is a structure or sub-part known immediately not only to someone who writes the code but also to someone who reads or uses the code.
  • 23. Design Patterns Command Pattern Mediator Strategy Template Method Observer
  • 24. Problem Results of map reduce are very difficult to process Must be simple and generic to use Solution add another level of indirection
  • 25. Waffle Dataset Batched data handling is major component Modeled as hash table with keys values Took ideas from Lua
  • 26. Coding – SDK Structure API – user level apis and call backs MapReduce – implementation Samples – code samples with callbacks sample implementation Tests – all the development tests Waffle – result set implementation
  • 27. SDK Logical View(by packages) • APIUser Level • Samples • Map Reduce Implementation Core • Tests • Waffle Database Result
  • 28. Implementation Android APIs  UI  Files from SDCard Java APIs  Concurrentlibraries  Use Java generics
  • 29. Key Code (async thread pool)MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String, Integer>>();int numThreads = 25;ExecutorService pool = Executors.newFixedThreadPool(numThreads);CompletionService<OutputUnit> futurePool = new ExecutorCompletionService<MapCallback.OutputUnit>(pool);Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>();// linear addition of jobs, parallel executionfor (TMapInput m : input) { futureSet.add(futurePool.submit(mapper.makeWorker(m)));}// tasks runningpool.shutdown();
  • 30. Testing Corner stone component Testing types  Functionality  Nullity and parameters Testing utilities such as sorting
  • 31. API Practices API Decisions Use referential transparency in APIs API patterns from Java collection Use Java generics
  • 32. Java Lingua-franca of Android development General, Concurrent, Class Based Object Oriented language Android has major Java language libraries (io, net,util, lang) Compiles to class format then transformed to dex format and runs on Dalvik virtual machine
  • 33. Lua Backend Provide REPL for on-line working with results
  • 34. Summary Effective Design Decisions  Verysimple API  Never return nulls  Checks for validity Runs fast on mega size databases
  • 35. Interested http://code.google.com/p/profiterole/ http://code.google.com/p/profiterole/downloads /list
  • 36. THANK YOU !boris.farber@gmail.com