archive-info.com » INFO » A » ALBERTON.INFO

Total: 86

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • Lorenzo Alberton - Articles
    MySQL NoSQL Oracle PDO PEAR Performance PHP PostgreSQL Profiling Scalability Security SPL SQL Server SQLite Testing Tutorial TYPO3 Windows Zend Framework Buy me a book Articles TYPO3 Book review TYPO3 Extension Development by Dmitry Dulepov PHP PEAR TYPO3 Book Review 8 January 2009 A review of the book TYPO3 Extension Development by Dmitry Dulepov Packt Publishing Lorenzo Alberton Lorenzo has been working with large enterprise UK companies for the past

    Original URL path: http://www.alberton.info/articles/filter/tag/TYPO3 (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Articles
    me a book Articles Windows PHP PDO Firebird status PHP PEAR Database Firebird SQL Windows PDO 11 April 2006 A quick roundup of the PHP PDO Firebird driver Test of its current status what works and what doesn t HowTo Install Firebird Interbase with PHP on Windows A step by step tutorial PHP PEAR Database Firebird SQL Windows 30 March 2006 A step by step tutorial to install Firebird SQL

    Original URL path: http://www.alberton.info/articles/filter/tag/Windows (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Articles
    Summary of the PHPNW10 conference Slides about my talk Profile your PHP application and make it fly and new job Zend Framework mod rewrite and public dir in shared hosting PHP Zend Framework mod rewrite 15 February 2009 A quick tip on how to deal with Zend Framework directory structure and the public document root directory in most shared hosting accounts using mod rewrite and an htaccess file Zend Framework

    Original URL path: http://www.alberton.info/articles/filter/tag/Zend%20Framework (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Projects
    both a Producer and a Consumer read more LibSVM2 http github com quipo LibSVM Experimental rewrite of libSVM and LibSVM Plus read more Politecnico di Torino Natural Language Processing http www corep it Developed several Automatic Text Classifiers with focus on opinion mining and sentiment analysis and an Information Extraction system for a R D job with a fund granted by the Research Consortium of Turin Polytechnic Field machine learning Languages C C Java read more Seagull PHP Framework http seagullproject org Seagull is a mature OOP framework CMS for building web command line and GUI applications read more WACT Web Application Component Toolkit http www phpwact org The Web Application Component Toolkit is a PHP framework for creating web applications WACT facilitates a modular approach where individual independent or reusable components may be integrated into a larger web application WACT assists in implementing the Model View Controller pattern and the related Domain Model Template View Front Controller and Application Controller patterns read more PEAR PHP Extension and Application Repository http pear php net PEAR is a structured library of high quality open source code for PHP users PEAR s mission is to provide reusable components lead innovation in PHP

    Original URL path: http://www.alberton.info/projects/ (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Articles - On batching vs. latency, and jobqueue models
    removed from the CPU cache at least twice the contention on the data provider is not removed alas merely moved downstream to another queue Remark 2 Batching increases latency Now that many workers have direct and concurrent access to the source to avoid hitting it too often each worker must request a larger batch but that means introducing a higher latency it s better to increase the number of workers and let them process fewer items each in parallel This is a sensible objection and yet why can it be wrong under many circumstances Two reasons unless the work is I O bound having more workers than CPU cores only increases contention and context switching in the OS scheduler a single process thread per CPU core is the best configuration to maximise CPU efficiency batching improves throughput and actually decreases latency Explaining this requires a little theory There are two kinds of batching you can wait for the batch to fill up or a timeout to occur and then ship it example Nagle s algorithm Performances of this approach are not much better than serial delivery This adds latency you can ship the first item that s available process it then come back and take all the data that was produced in the meantime and keep looping that way Applying Little s Law to an example if you need to send 10 messages to a device with 100 s latency a serial delivery would take 1ms whilst assuming it ships one item first and then the other 9 in another batch batching strategy 2 would have an average latency of at most 190 s although on average 5 items will be available for each delivery resulting in an average latency of only 150 s Batching can decrease latency I do however concede that the original statement might be true if a the task is NOT CPU bound and b after fetching a batch the items are processed serially within a single worker when another worker might be have been idle and able to process some items on an idle CPU core Remark 3 Prefetching a large batch of items speeds up work distribution The main idea behind the original job queue was to minimise the number of requests to the data provider by fetching data in large chunks and have each worker consume a smaller batch each from a fast memory buffer Now ignoring the fact that modern data sources can happily sustain 100K connections per second making the need for one single collector go away this remark is based on the assumption that the internal queue is always half full and the Manager process is actually reading large chunks all the time Reality is queues are on average either completely full or empty In fact it s very difficult to have a perfect ideal balance between production and consumption rates You d rather hope consumption always outpaces production or you step into the nasty business of unbounded queues So

    Original URL path: http://www.alberton.info/batching_vs_latency_and_jobqueue_models.html (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Articles - Updated Kafka PHP client library
    compressed messages completely refactored socket handling to be more robust with better error checking and handling of edge cases added support for 64bit offsets better checks for responses from Kafka fixed connection hanging added Zookeeper based consumer with support for multiple consumer groups and for manual offset commit action so it s possible to wait for an ACK from the message processor before advancing the offsets and example code added support for OffsetRequest and getOffsetsBefore in the SimpleConsumer class to query the state of the queue and recover consumption after old offsets become invalid because of expired data support for connection timeouts in microseconds vastly improved test suite Overall this is a pretty solid release we ve been using it in production for over a month without any problem Get it while it s hot To use the library you can either apply this patch to the official repository or check out the code from this github repo Update check out the code from the official Kafka PHP git repository Example code for Producer php host localhost port 9092 topic test producer new Kafka Producer host port Kafka Encoder COMPRESSION NONE in fopen php stdin r while true echo nEnter comma separated messages n messages explode fgets in bytes producer send messages topic printf nSuccessfully sent d messages d bytes n n count messages bytes Example code for Simple Consumer php host localhost port 9092 kafka server topic test maxSize 1000000 socketTimeout 5 offset 0 partition 0 consumer new Kafka SimpleConsumer host port socketTimeout maxSize while true create a fetch request for topic test partition 0 current offset and fetch size of 1MB fetchRequest new Kafka FetchRequest topic partition offset maxSize get the message set from the consumer and print them out partialOffset 0 messages consumer fetch fetchRequest foreach messages as msg echo nconsumed offset partialOffset msg payload partialOffset messages validBytes advance the offset after consuming each message offset messages validBytes unset fetchRequest Example code for Zookeeper based Consumer php zookeeper address one or more separated by commas zkaddress localhost 8121 kafka topic to consume from topic testtopic kafka consumer group group testgroup socket buffer size must be greater than the largest message in the queue socketBufferSize 10485760 10 MB approximate max number of bytes to get in a batch maxBatchSize 20971520 20 MB zookeeper new Zookeeper zkaddress zkconsumer new Kafka ZookeeperConsumer new Kafka Registry Topic zookeeper new Kafka Registry Broker zookeeper new Kafka Registry Offset zookeeper group topic socketBufferSize messages array try foreach zkconsumer as message either process each message one by one or collect them and process them in batches messages message if zkconsumer getReadBytes maxBatchSize break catch Kafka Exception OffsetOutOfRange exception if we haven t received any messages resync the offsets for the next time then bomb out if zkconsumer getReadBytes 0 zkconsumer resyncOffsets die exception getMessage if we did receive some messages before the exception carry on catch Kafka Exception Socket Connection exception deal with it below catch Kafka Exception exception deal with it below if

    Original URL path: http://www.alberton.info/kafka_07_php_client_library.html (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Articles - Musings on some technical papers I read this weekend: Google Dremel, NoSQL comparison, Gossip Protocols
    so much for linear scaling In VoltDB s defence synchronous queries and scans as done in the YCSB client are quite inefficient as are for MySQL see here and here As the paper notes HBase is really difficult to configure properly and most of the apparently poor results are likely due to the default HBase client which we also found to be a real bottleneck I really appreciated their mention of disk usage in the different stores given the same amount of input data it s in line with our own experience at DataSift When dealing with lots of data it might be an useful piece of information to consider for capacity planning hardware provisioning One note about disk usage and data compression in the paper startled me a bit they say that obviously disk usage can be reduced by using compression which however will decrease throughput and thus it s not used in their test This might be true for their specific use case but it s plain wrong when size and amount of data grow in fact compression usually improves throughput less data to scan more records fit in a memory page Finally it would have been nice to mention bandwidth usage within the nodes in the cluster as this bit us a few times in the past and was important enough for us to invest in serious Arista networking gear Dremel Interactive Analysis of Web Scale Datasets This is a paper I read more than once since it might soon become seminal like the BigTable one Dremel is a scalable query system for read only nested data used at Google since 2006 and offered as web service called BigQuery The paper has a few interesting take aways The first big win in query speed if you re only interested in a few fields in a record is obtained by moving by record storage to columnar storage To make a parallel with the RDBMs world this is exactly the reason why indexes exist if you don t have to do a full table scan but only need to evaluate a single field expect huge performance improvements Also columnar formats compress very well thus leading to less I O and memory usage Of course this is nothing new many commercial and open source implementations exist with column oriented storage layout capabilities like Vertica Infobright and to a certain extent HBase and Cassandra The difference is Dremel s capability of handling nested data models which explains the novel data representation And the data representation is probably the first hard bit to grasp as it s not intuitive I d suggest carefully reading the explanation and the algorithm in the Appendix What I really like about Dremel is the explicit trade offs made to obtain the incredible query speed at scale A limited query capability no JOINs to optimise latency then again they have Sawzall for complex queries While not as expressive as SQL or MapReduce Dremel really shows its power

    Original URL path: http://www.alberton.info/musings_on_technical_papers_dremel_nosql_gossip.html (2016-04-23)
    Open archived version from archive

  • Lorenzo Alberton - Articles - Historical Twitter access - A journey into optimising Hadoop jobs
    as a consequence A better approach chunking strategy 2 is to divide the timeline in slots of equal predetermined size and then to partition the queries according to the slots boundaries they fall into Of course the overlap of chunks on a slot thus determined is not always perfect but it s easy to skip the records outside the chunk range when the query is executed whilst still enjoying the benefits of an easier partitioning strategy and an upper bound on the amount of work for each chunk Let me explain the two strategies with a diagram Chunking strategy 1 Table 1 Pending queries Query Time of arrival From To Q1 t A D Q2 t 30m A C Q3 t 45m B E Supposing for simplicity that we can only run one M R job at the same time and that it takes 1 hour to scan a single segment of the archive e g the segment AB or BC If at time t we receive the query Q1 and the job tracker is free Q1 is immediately executed as a single job and it will take 3 hours to complete AB BC CD At time t0 30m we receive Q2 and at time t0 45 we receive Q3 but both remain in the pending queue until the first job is done At time t0 3h we can start processing the segment BC for Q2 and Q3 together at time t0 4h we can run the segment AB for query Q2 at time t0 5h we can run the segment CE for query Q3 The total process requires 7 hours Table 2 Chunking strategy 1 Time Pending queue Chosen for execution Execution time t Q1 AD Q1 AD 3 hours t 3h Q2 AC Q3 BE Q2 Q3 BC 1 hour t 4h Q2 AB Q3 CE Q2 AB 1 hour t 5h Q3 CE Q3 CE 2 hours Chunking strategy 2 If now we decide to chunk every query in slots of equal size even if there s no overlap at this time here s what happens with the same queries as above At time t only query Q1 is available its segment AB is started At time t 1h we can run the segment BC for all three queries At time t 2h we can run segment CD for queries Q1 and Q3 At time t 3h we run segment AB for query Q2 and at time t 4h we run the last segment DE for query Q3 The total process now only requires 5 hours thanks to the more dynamic scheduling and to unconditional chunking into slots of predictable execution time Table 3 Chunking strategy 2 Time Pending queue Chosen for execution Execution time t AB Q1 BC Q1 CD Q1 AB Q1 1 hour t 1h AB Q2 BC Q1 Q2 Q3 CD Q1 Q3 DE Q3 BC Q1 Q2 Q3 1 hour t 2h AB Q2 CD Q1 Q3 DE Q3 CD Q1 Q3

    Original URL path: http://www.alberton.info/datasift_historical_twitter_access.html (2016-04-23)
    Open archived version from archive



  •