the challenge

the challenge

Dragan Milosevic - Bringing Together Millions of Publishers and Thousands of Advertisers

10mo ago
SOURCE  

Description

Bringing Together Millions of Publishers and Thousands of Advertisers by Zanox Reporting Systems built on top of Hadoop and Lucene Technologies Dr. Dragan Milosevic, Senior Hadoop Architect, dragan.milosevic@zanox.com Zanox is company that deals with the huge amount of data. Currently our tracking gets more than 2 million sales, 30 million clicks and almost 1 billion views every day. Huge amount of data are also coming from search-engines that are providing information about related costs. The challenge is to join and summarise those huge sets of data and provide valuable tracking and cost statistics to more than 1 million publishers and 10 thousands advertisers. They will use our analyses to successfully drive their business by knowing which launched campaigns are well performing and how others can be changed to bring even more money. This talk will have two parts. The first part shows how Hadoop helps in analysing available data and saving valuable results in Lucene indexes. The second part describes Lucene search infrastructure that is able to efficiently generate reports in real time for publishers and advertisers. Challenges to be solved during Hadoop processing can be summarised as follows: (1) Joining several huge sets of data, namely the data that are tracked by Zanox, costs data coming from search-engines and master-data about publishers and advertisers. We have developed unique approach that combines map-side and reduce-side joins, trying to combine the best of both techniques. It does sampling to identify records which have join-key that is frequent. Those records will be joined on a map-side and directly written to output without loading map-reduce pipeline. Only records whose join-key is infrequent will be propagated to reducers in order to be joined there. Provided experiments have shown significant gains compared to pure reduce-side join mainly due to the fact that more than 80% of records can be joined on a map-side, resulting that only one fifth of records have to be sorted and joined on a reduce-side. (2) Several aggregation jobs are found to be very important for us, mainly due to the huge number of records that have to be summarised. We have made many experiments aiming to speed-up map-reduce engine, being in our particular case slow mainly due to the fact that sorting on a reduce-side has to wait until all map-tasks are not completed. In the case where thousands of map-tasks have to be executed, reduce tasks are waiting a significant amount of time. Our solution is very simple, and it is based on replacing one huge job with several smaller ones that are responsible only for the selected part of input data. Because smaller jobs have much less map-tasks that have to be finished before reduce-tasks can start with expensive sorting and aggregation operations, we have managed to earlier activate resources that are responsible for reduce tasks and on average to speed-up complete aggregation for more than 30%. (3) Lucene indexes save every aggregated result produced by Hadoop jobs to make them real-time available to publishers and advertisers. Created indexes represent different views to aggregated data, being responsible to speed-up the processing of queries. We have optimised indexing performance by (a) building different aggregations of every record simultaneously, (b) using intelligent partitioners that are sending the semantically same aggregated-versions of records to same reducers, and (c) taking care of producing medium-size indexes that can be generated completely in memory on a reduce-side. Lucene search infrastructure is dealing with the real-time generation of reports on a request basis. Because the response time plays very important role while serving online requests, the architecture of our search-backend is optimised by following actions: (1) Indexing data with different aggregation granularities to optimise the execution of various types of queries. Ideally the requested info...